+ All Categories
Home > Documents > JOURNAL OF LA Easy-to-Deploy API Extraction by Multi-Level ... · sufficient high-quality training...

JOURNAL OF LA Easy-to-Deploy API Extraction by Multi-Level ... · sufficient high-quality training...

Date post: 14-Aug-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
16
JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Easy-to-Deploy API Extraction by Multi-Level Feature Embedding and Transfer Learning Suyu Ma, Zhenchang Xing, Chunyang Chen, Cheng Chen, Lizhen Qu, Guoqiang Li Abstract—Application Programming Interfaces (APIs) have been widely discussed on social-technical platforms (e.g., Stack Overflow). Extracting API mentions from such informal software texts is the prerequisite for API-centric search and summarization of programming knowledge. Machine learning based API extraction has demonstrated superior performance than rule-based methods in informal software texts that lack consistent writing forms and annotations. However, machine learning based methods have a significant overhead in preparing training data and effective features. In this paper, we propose a multi-layer neural network based architecture for API extraction. Our architecture automatically learns character-, word- and sentence-level features from the input texts, thus removing the need for manual feature engineering and the dependence on advanced features (e.g., API gazetteers) beyond the input texts. We also propose to adopt transfer learning to adapt a source-library-trained model to a target-library, thus reducing the overhead of manual training-data labeling when the software text of multiple programming languages and libraries need to be processed. We conduct extensive experiments with six libraries of four programming languages which support diverse functionalities and have different API-naming and API-mention characteristics. Our experiments investigate the performance of our neural architecture for API extraction in informal software texts, the importance of different features, the effectiveness of transfer learning. Our results confirm not only the superior performance of our neural architecture than existing machine learning based methods for API extraction in informal software texts, but also the easy-to-deploy characteristic of our neural architecture. Index Terms—API extraction, CNN, Word embedding, LSTM, Transfer learning 1 I NTRODUCTION A PPLICATION Programming Interfaces (APIs) are a set of definitions, functions and modules for developing software programs. To support the use of APIs and solve the usage issues, developers not only create formal API speci- fications and tutorials (e.g., Java API, Android Developers), but also generate large numbers of informal discussions on APIs (e.g., Stack Overflow questions and answers) [1]. Distinguishing API mentions from general natural language words in API documentation is referred to as API extraction or API recognition in the literature [2]. Fig. 1 illustrates an example of API extraction in natural language sentences. API extraction is crucial for many downstream applications. For traceability recovery across software documents, API extraction lays the foundation of linking code-like terms to specific code elements in an API or API documentation [3], [4]. For entity-centric search, API extraction can be exploited to create a thesaurus of software-specific terms and commonly used morphological forms [5], [6], build an API caveats knowledge graph [7] and search for appropri- ate APIs for programming tasks [8]. For domain-specific question answering, API extraction can help select answer paragraphs and generate useful answer summary [9]. Suyu Ma, Chunyang Chen (corresponding author) and Lizhen Qu are with Faculty of Information Technology, Monash University, Aus- tralia. E-mail: [email protected], [email protected], [email protected]. Zhenchang Xing is with College of Engineering & Computer Science, Australian National University, Australia. E-mail: zhen- [email protected] Cheng Chen is with PricewaterhouseCoopers Firm, China. E-mail: [email protected] Guoqiang Li (corresponding author) is with School of Software, Shanghai Jiao Tong University, China. E-mail: [email protected] Manuscript received October 31, 2018; revised August 24, 2019. Unlike formal API documentation where API mentions are consistently written and annotated, API mentions in informal software texts usually lack consistent writing forms and annotations [2]. For example, the methods apply and bfill in Fig. 1 are not mentioned in their fully qualified name and are not annotated with a special tag like <code>. Fur- thermore, an API may have a common-word simple name (e.g., apply, series). Our analysis of API simple names in six libraries of four programming languages reveals that 6% to 66% (median 42%) APIs of these libraries have common- word simple name. Such APIs are referred to as polysemous APIs [2], because mentioning them in their simple name without special tag creates a common-word polysemy issue for API extraction [2]. Polysemous API mentions, together with other informal- ity of API mentions, render rule-based extraction of API mentions unreliable for informal software texts. Recently, several machine learning based API extraction methods [2], [10] have been proposed. These machine learning based methods have demonstrated superior performance for API extraction in informal texts than rule-based methods. How- ever, a major barrier for deploying such machine learning based methods is the significant overhead required for manual labeling of model training data and manual feature engineering. API extraction in informal software texts can be regarded as a Named Entity Recognition (NER) task [11] in Natural Language Processing (NLP). An NER task in general English text detects mentions of named entities, such as people, locations and organizations. It deals with single language. However, API extraction has to deal with hundreds of libraries and frameworks discussed by developers. Training a reliable machine learning based API extraction model for a
Transcript
Page 1: JOURNAL OF LA Easy-to-Deploy API Extraction by Multi-Level ... · sufficient high-quality training data for APIs of some less frequently discussed libraries or frameworks. Another

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 1

Easy-to-Deploy API Extraction by Multi-LevelFeature Embedding and Transfer Learning

Suyu Ma Zhenchang Xing Chunyang Chen Cheng Chen Lizhen Qu Guoqiang Li

AbstractmdashApplication Programming Interfaces (APIs) have been widely discussed on social-technical platforms (eg Stack Overflow)Extracting API mentions from such informal software texts is the prerequisite for API-centric search and summarization of programmingknowledge Machine learning based API extraction has demonstrated superior performance than rule-based methods in informalsoftware texts that lack consistent writing forms and annotations However machine learning based methods have a significantoverhead in preparing training data and effective features In this paper we propose a multi-layer neural network based architecture forAPI extraction Our architecture automatically learns character- word- and sentence-level features from the input texts thus removingthe need for manual feature engineering and the dependence on advanced features (eg API gazetteers) beyond the input texts Wealso propose to adopt transfer learning to adapt a source-library-trained model to a target-library thus reducing the overhead of manualtraining-data labeling when the software text of multiple programming languages and libraries need to be processed We conductextensive experiments with six libraries of four programming languages which support diverse functionalities and have differentAPI-naming and API-mention characteristics Our experiments investigate the performance of our neural architecture for API extractionin informal software texts the importance of different features the effectiveness of transfer learning Our results confirm not only thesuperior performance of our neural architecture than existing machine learning based methods for API extraction in informal softwaretexts but also the easy-to-deploy characteristic of our neural architecture

Index TermsmdashAPI extraction CNN Word embedding LSTM Transfer learning

1 INTRODUCTION

A PPLICATION Programming Interfaces (APIs) are a setof definitions functions and modules for developing

software programs To support the use of APIs and solve theusage issues developers not only create formal API speci-fications and tutorials (eg Java API Android Developers)but also generate large numbers of informal discussionson APIs (eg Stack Overflow questions and answers) [1]Distinguishing API mentions from general natural languagewords in API documentation is referred to as API extractionor API recognition in the literature [2] Fig 1 illustrates anexample of API extraction in natural language sentencesAPI extraction is crucial for many downstream applicationsFor traceability recovery across software documents APIextraction lays the foundation of linking code-like termsto specific code elements in an API or API documentation[3] [4] For entity-centric search API extraction can beexploited to create a thesaurus of software-specific termsand commonly used morphological forms [5] [6] build anAPI caveats knowledge graph [7] and search for appropri-ate APIs for programming tasks [8] For domain-specificquestion answering API extraction can help select answerparagraphs and generate useful answer summary [9]

bull Suyu Ma Chunyang Chen (corresponding author) and Lizhen Quare with Faculty of Information Technology Monash University Aus-tralia E-mail masuyu2015outlookcom chunyangchenmonasheduLizhenQumonashedu

bull Zhenchang Xing is with College of Engineering amp ComputerScience Australian National University Australia E-mail zhen-changxinganueduau

bull Cheng Chen is with PricewaterhouseCoopers Firm China E-mailcc94226livecom

bull Guoqiang Li (corresponding author) is with School of Software ShanghaiJiao Tong University China E-mail li-gqcssjtueducn

Manuscript received October 31 2018 revised August 24 2019

Unlike formal API documentation where API mentionsare consistently written and annotated API mentions ininformal software texts usually lack consistent writing formsand annotations [2] For example the methods apply andbfill in Fig 1 are not mentioned in their fully qualified nameand are not annotated with a special tag like ltcodegt Fur-thermore an API may have a common-word simple name(eg apply series) Our analysis of API simple names in sixlibraries of four programming languages reveals that 6 to66 (median 42) APIs of these libraries have common-word simple name Such APIs are referred to as polysemousAPIs [2] because mentioning them in their simple namewithout special tag creates a common-word polysemy issuefor API extraction [2]

Polysemous API mentions together with other informal-ity of API mentions render rule-based extraction of APImentions unreliable for informal software texts Recentlyseveral machine learning based API extraction methods [2][10] have been proposed These machine learning basedmethods have demonstrated superior performance for APIextraction in informal texts than rule-based methods How-ever a major barrier for deploying such machine learningbased methods is the significant overhead required formanual labeling of model training data and manual featureengineering

API extraction in informal software texts can be regardedas a Named Entity Recognition (NER) task [11] in NaturalLanguage Processing (NLP) An NER task in general Englishtext detects mentions of named entities such as peoplelocations and organizations It deals with single languageHowever API extraction has to deal with hundreds oflibraries and frameworks discussed by developers Traininga reliable machine learning based API extraction model for a

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 2

Fig 1 Illustrating API Extraction Task

library often requires several hundreds of manually labeledsentences mentioning this libraryrsquos APIs [2] The effort toprepare training data for hundreds of libraries would beprohibitive Furthermore it may also be difficult to preparesufficient high-quality training data for APIs of some lessfrequently discussed libraries or frameworks

Another related challenge is to select effective featuresfor a machine learning model to recognize a particularlibraryrsquos APIs Although developers follow general namingconventions orthographic features of APIs still vary greatlyfrom one library to another For example as reported inSection 2 different libraries have different percentages ofpolysemous API names Furthermore users of some li-braries tend to mention APIs with clear orthographic fea-tures (eg package names bracket andor dot) while usersof other libraries tend to directly mention API simple namesFunctionalities of software libraries also vary greatly suchas graphical user interface numeric computation machinelearning data visualization database access As such dis-cussion contexts of a libraryrsquos APIs like Pandas (a Pythonmachine learning library) often differ from those of anotherlibraryrsquos APIs like JDBC (a Java database access library)

Designers of a machine learning based API extractionmodel have to manually select the most effective featuresfor different librariesrsquo APIs This is a challenging task asthere are dozens of features to choose from1 Unsupervisedword embeddings have been explored for API extractiontasks [2] but there has been no work on exploiting character-and sentence-context embeddings from input texts for APIextraction Furthermore some advanced features to boostAPI extraction performance such as word clusters and APIgazetteers have to be hand-craftedWithout such advancedfeatures existing machine learning based API extractionmethods perform poorly using only orthographic featuresfrom the input texts [2]

The easy deployment is defined as that the model canbe easily trained for different datasets without requiringany manual feature engineering To make machine learningbased API extraction methods easy to deploy in practicewe must reduce the overhead of preparing training dataand effective features and remove the dependence onadditional features beyond input texts In this paper wedesign a neural architecture for API extraction in informalsoftware text Our neural architecture is composed of thecharacter-level convolutional neural network (CNN) word-

1 httpsnlpstanfordedunlpjavadocjavanlpedustanfordnlpieNERFeatureFactoryhtml

level embeddings and sentence-level Bi-directional LongShort-Term Memory (Bi-LSTM) network for automaticallylearning character- word- and sentence-level features frominput texts respectively This neural architecture can betrained in an end-to-end fashion thus removing the needfor manual feature engineering and the need for addi-tional features beyond input texts and greatly reducing theamount of new training data needed for adapting a modelto different libraries

Furthermore our analysis of the API-naming and API-mention characteristics suggests that the characteristics ofAPI names API mentions and discussion contexts differacross libraries but they also share certain level of common-alities To exploit such commonalities for easy deploymentof API extraction model we adopt transfer learning [12][13] to fine-tune a model trained with a source libraryrsquos APIdiscussion texts to a target library This helps reduce theamount of training data required for training a high-qualitytarget-library model compared with training the modelfrom scratch with randomly initialized model parametersThe design of our multi-level neural architecture enables thefine-tuning of different levels of features in transfer learning

We conduct extensive experiments to evaluate the per-formance of the proposed neural architecture for API ex-traction as well as the effectiveness of transfer learning Ourexperiments involve three Python libraries (Pandas NumPyand Matplotlib) one Java library (JDBC) one JavaScriptlibrary (React) and one C library (OpenGL) As discussedin Section 2 these six libraries support diverse function-alities and have different API-naming API-mention anddiscussion-context characteristics We manually label APImentions in 3600 Stack Overflow posts (600 for each li-brary) for the experiments Our experiments confirm theeffectiveness of our neural architecture in learning multi-level features from the input texts and show that the learnedfeatures can support high-quality API extraction in informalsoftware texts without the need for additional hand-craftedfeatures beyond the input texts Our experiments also con-firm the effectiveness of transfer learning [14] in boosting thetarget-library model performance with much less trainingdata even in few-shot (about 10 posts) training settings

This paper makes the following four contributionsbull Our work is the first one to consider not only the

performance of machine learning based API extrac-tion methods but also the easy deployment of suchmethods for the software text of multiple programminglanguages and libraries

bull We propose a multi-layer neural architecture to auto-matically learn to extract effective features from the in-put texts for API extraction thus removing the need formanual feature engineering as well as the dependenceon features beyond the input texts

bull We adopt transfer learning to reduce the overhead ofmanual labeling of the training data of a subject libraryWe evaluate the effectiveness of transfer learning acrosslibraries and programming languages and analyze thefactors that affect its effectiveness

bull We conduct extensive experiments to evaluate our ar-chitecture as a whole as well its components Ourresults reveal insights into the design of effective mech-anisms for API extraction tasks

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 3

TABLE 1 Statistics of Polysemous APIs

Library APIs PolysemousAPIs

Percentage

Matplotlib 3877 622 1604Pandas 774 426 5504Numpy 2217 917 4136Opengl 850 52 612React 238 157 6597JDBC 1468 633 4312

The remainder of the paper is organized as follows Sec-tion 2 reports our empirical studies of the characteristics ofAPI-names API-mentions and discussion contexts Section 3defines the problem of API extraction Section 4 and Sec-tion 5 describe our neural architecture for API extraction andthe system implementation respectively Section 6 reportsour experiment results and findings Section 7 reviews therelated work Section 8 concludes our work and discuss thefuture work

2 EMPIRICAL STUDIES OF API-NAMING AND API-MENTION CHARACTERISTICS

In this work we aim to develop machine learning based APIextraction method that is not only effective but also easy-to-deploy across programming languages and libraries Tounderstand the challenges in achieving this objective andthe potential solution space we conduct empirical studies ofthe characteristics of API names API mentions in informaltexts and discussion contexts in which APIs are mentioned

We study six libraries three Python libraries Matplotlib(data visualization) Pandas (machine learning) Numpy(numeric computation) one C library OpenGL (computergraphics) one JavaScript library React (graphical user inter-face) and one Java library JDBC (database access) Theselibraries come from the four popular programming lan-guages and they support very diverse functionalities forcomputer programming

First we crawl API declarations of these libraries fromtheir official websites When different APIs have a samesimple name but different arguments in a same library wetreat such APIs as the same We examine each API nameto determine if the simple name of an API is a commonword (eg apply series draw) that can be found in a generalEnglish dictionary We find that different libraries havedifferent percentages of APIs with common-word simplenames OpenGL (6) Matplotlib (16) Numpy (41) JDBC(43) Pandas (55) React (66) When these APIs arementioned by their common-word simple names neithercharacter- nor word-level features can help to distinguishsuch polysemous API mentions from common words Wemust resort to discussion contexts of API mentions

Second by checking post tags we randomly sample200 Stack Overflow posts for each of the six libraries Wemanually label API mentions in these posts We examinethree characteristics of API mentions whether API mentionscontain explicit orthographic features (package or modulenames parentheses andor dot) whether API mentionsare long tokens (gt 10 characters) and whether the contextwindows (preceding and succeeding 5 words) around theAPI mentions contain common verbs and nouns (use call

TABLE 2 Statistics of API-Mention Characteristics

Library Orthographic Long tokens Commoncontext words

Matplotlib 6238 2156 2064Pandas 6711 3222 3423Numpy 6563 2687 2353Opengl 3373 3936 2080React 7556 2000 793JDBC 2636 6182 811

Average 5513 3364 1921

function method) Table 2 shows our analysis results Onaverage 5513 API mentions contain explicit orthographicand 3364 API mentions are long tokens Character-levelfeatures would be useful for recognizing these API men-tions However for the significant amount of API mentionsthat do not have such explicit character-level features weneed to resort to word- andor sentence-content featuresfor example the words (eg use call function method)that often appear in the context window of an API mentionto recognize API mentions

Furthermore we can observe that API-mention charac-teristics are not tightly coupled with a particular program-ming language or library Instead all six libraries exhibit cer-tain degree of the three API-mention characteristics But spe-cific degrees vary across libraries Fig 2 visualizes the top 50frequently-used words in the discussions of the six librariesWe can observe that discussions of different libraries sharecommon words but at the same time use library-specificwords (eg dataframe for Pandas matrix for Numpy figurefor Matplotlib query for JDBC render for OpenGL eventfor React) The commonalities of API mention characteristicsacross libraries indicate the feasibility of transfer learningFor example orthographic word andor sentence-contextfeatures learned from a source library could be applicableto a target library However due to the variations of API-name API-mention and discussion-context characteristicsacross libraries directly applying the source-library trainedmodel to the target library may suffer from performancedegradation unless the source and target libraries have verysimilar characteristics Therefore fine-tuning the source-library trained model with a small amount of target librarytext would be necessary

3 PROBLEM DEFINITION

In this work we takes as input informal software text (egStack Overflow posts) that discusses the usage and issuesof a particular library We assume that the software textof multiple programming languages and libraries need tobe processed Given a paragraph of informal software textour task is to recognize all API mentions (if any) in theparagraph as illustrated in the example in Fig 1 APImentions refer to tokens in the paragraph that representpublic modules classes methods or functions of a particularlibrary To preserving the integrity of code-like tokens anAPI mention is defined as a single token rather than a spanof tokens when the given text is tokenized properly

As our input is informal software text APIs may not beconsistently mentioned in their formal full names InsteadAPIs may be mentioned in abbreviations or synonymssuch as pandasrsquos dataframe for panadsDataFrame dfapply for

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 4

Fig 2 Word Clouds of the Top 50 Frequent Words in the Discussions of the Six Libraries

pandasDataFrameapply() APIs may also be mentioned intheir simple names such as apply series dataframe that canalso be common English words or computing terms in thetext Furthermore we do not assume that API mentions willbe consistently annotated with special tags Therefore ourapproach takes plain text as input

A related task to our work is API linking API extractionmethods classify whether a token in text is an API mentionor not while API linking methods link API mentions intext to API entities in a knowledge base [15] That is APIextraction is the prerequsite for API linking This work dealswith only API extraction

4 OUR NEURAL ARCHITECTURE

We formulate the task of API extraction in informal soft-ware texts as a sequence labeling problem and presenta neural architecture that labels each token in an inputtext as API or non-API As shown in Fig 3 our neuralarchitecture is composed of four main components 1) acharacter-level Convolutional Neural Network (CNN) forextracting character-level features of a token (Section 41)2) an unsupervised word embedding model for learningword semantics of a token (Section 42) 3) a BidirectionalLong Short-Term Memory network (Bi-LSTM) for extractingsentence-context features (Section 43) and 4) a softmaxclassifier for predicting the API or non-API label of a token(Section 44) Our neural model can be trained end-to-endwith pairs of input texts and their corresponding APInon-API label sequences A model trained with one libraryrsquos textcan be transferred to another libraryrsquos text by fine-tuning

Fig 3 Our Neural Architecture for API Extraction

source-library-trained components with the target libraryrsquostext (Section 45)

41 Extracting Char-Level Features by CNN

API mentions often have morphological features that dis-tinguish them from normal English words Such morpho-logical features may appear in the beginning of a token(eg the first letter capitalization System) in the middle(eg the hyphen in read csv the dot in pandasseries the left

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 5

Fig 4 Our Character Vocabulary

Fig 5 Character-Level CNN

parenthesis and comma in print(a32)) or at the end (egthe right parenthesis in apply()) Morphological features mayalso appear in combination such as camelcase writing likeAbstractAction a pair of parentheses like plot() The longlength of some tokens like createdataset is one importantmorphological feature as well Due to the lack of universalnaming convention across libraries and the wide presenceof informal writing forms informative morphological fea-tures of API mentions often vary from one libraryrsquos text toanother

Robust methods to extract morphological features fromtokens must take into account all characters of the tokenand determine which features are more important for a par-ticular libraryrsquos APIs [16] To that end we use a character-level CNN [17] which extracts local features in N-gramcharacters of the token using a convolution operation andthen combines them using a max-pooling operation to createa fixed-sized character-level embedding of the token [18][19]

Let V char be the vocabulary of characters for the soft-ware texts from which we want to extract API mentionsIn this work V char for all of our models consists of 92characters including 26 English letters (both upper andlower case) 10 digits 30 other characters (eg rsquorsquorsquo[rsquorsquorsquo)as listed in Fig 4 Note that V char can be easily extendedfor different datasets Let Echar isin Rdchartimes|V char| be thecharacter embedding matrix where dchar is the dimension ofcharacter embeddings and |V char| is the vocabulary size (92in this work) As illustrated in Fig 3 Echar can be regardedas a dictionary of character embeddings in which a columndchar-dimensional vector corresponds to a particular char-acter The character embeddings are initialized as one-hotvectors and then learned during the training of character-level CNN Given a character c isin V char its embedding ec

can be retrieved by the matrix-vector product ec = Echarvc

where vc is a one-hot vector of size |V char| which has value1 at index c and zero in all other positions

Fig 5 presents the architecture of our character-levelCNN Given a token w in the input text letrsquos assume wis composed of M characters c1 c2 cM We first obtaina sequence of character embeddings ec1 ec2 ecM by

looking up the character embeddings matrix Echar Thissequence of character embeddings (zero-padding at thebeginning and the end of the sequence) is the input to ourcharacter-level CNN In our application of CNN becauseeach character is represented as a dchar-dimensional vectorwe use convolution filters with widths equal to the dimen-sionality of the character embeddings (ie dchar) Then wecan vary the height h (or window size) of the filter iethe number of adjacent characters considered jointly in theconvolution operation

Let zm be the concatenation of the character embeddingsof cm (1 le m le M) the (h minus 1)2 left neighbors of cmand the (h minus 1)2 right neighbors of cm A convolutionoperation involves a filter W isin Rhdchar

(a matrix of htimesdchar

dimensions) and a bias term b isin Rh which is appliedrepeatedly to each character window of size h in the inputsequence c1 c2 cM

om = ReLU(WT middot zm + b)

where ReLU(x) = max(0 x) is the non-linear activa-tion function The convolution operations produce a M -dimensional feature map for a particular filter A 1D-maxpooling operation is then applied to the feature map toextract a scalar (ie a feature vector of length 1) with themaximum value in the M dimensions of the feature map

The convolution operation extracts local features withineach character window of the given token and using themax over all character windows of the token we extracta global feature for the token We can apply N filters toextract different features from the same character windowThe output of the character-level CNN is an N -dimensionalfeature vector representing the character-level embedding ofthe given token We denote this embedding as echarw for thetoken w

In our character-level CNN the matrices Echar and W and the vector b are parameters to be learned The dimen-sionality of the character embedding dchar the number offilters N and the window size of the filters h are hyper-parameters to be chosen by the user (see Section 52 formodel configuration)

42 Learning Word Semantics by Word Embedding

In informal software texts the same API is often mentionedin many non-standard abbreviations and synonyms [5] Forexample the Pandas library is often written as pd and itsmodule DataFrame is often written as df Furthermore thereis lack of consistent use of verb noun and preposition inthe discussions [2] For example in the sentences ldquoI havedecided to use apply rdquo ldquoif you run apply on a series rdquoand ldquoI tested with apply rdquo users refer to a Pandasrsquos methodapply() but their descriptions vary greatly

Such variations result in out-of-vocabulary (OOV) issuefor a machine learning model [20] ie variations that havenot been seen in the training data For the easy deploymentof a machine learning model for API extraction it is imprac-tical to address the OOV issue by manually labeling a hugeamount of data and developing a comprehensive gazetteerof API names and their common variations [2] Howeverwithout the knowledge about variations of semantically

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 6

similar words the trained model will be restricted to theexamples that it sees in the training data

To address this dilemma we propose to exploit unsu-pervised word-embedding models [21] [22] [23] [24] tolearn distributed word representations from a large amountof unlabeled text discussing a particular library Manystudies [25] [26] [27] have shown that distributed wordrepresentations can capture rich semantics of words suchthat semantically similar words will have similar wordembeddings

Let V word be the vocabulary of words for the corpusof software texts to be processed As illustrated in Fig 3word-level embeddings are encoded by column vectors ina word embedding matrix Eword isin Rdwordtimes|V word| wheredword is the dimension of word embeddings and |V word|is the vocabulary size Each column Eword

i isin Rdword

corre-sponds to the word-level embedding of the i-th word inthe vocabulary V word We can obtain a token wrsquos word-level embedding eword

w by looking up Eword with the wordw ie eword

w = Ewordvw where vw is a one-hot vector ofsize |V word| which has value 1 at index w and zero in allother positions The matrix Eword is to be learned usingunsupervised word embedding models (eg GloVe [28])and dword is a hyper-parameter to be chosen by the user(see Section 52 for model configuration)

In this work we adopt the Global Vectors for WordRepresentation (GloVe) method [28] to learn the matrixEword GloVe is an unsupervised algorithm for learningword representations based on the statistics of word co-occurrences in an unlabeled text corpus It calculates theword embeddings based on a word co-occurrence matrixX Each row in the word co-occurrence matrix correspondsto a word and each column corresponds to a context Xij

is the frequency of word i co-occurring with word j andXi =

983123Xik (1 le k le |V word|) is the total number of occur-

rences of word i in the corpus The probability of word j thatoccurs in the context of word i is Pij = P (j|i) = XijXiWe have log(Pij) = log(Xij)minus log(Xi)

GloVe defines log(Pij) = eTwiewj where ewi and ewj

are the word embeddings to be learned for the word wi

and wj This gives the constraint for each word pair aslog(Xij) = eTwi

ewj + bi + bj where b is the bias termfor ew The cost function for minimizing the loss of wordembeddings is defined as

V word983131

ij=1

f(Xij)(eTwiewj

+ bi + bj minus log(Xij))

where f(Xij) is a weighting function That is GloVe learnsword embeddings by a weighted least square regressionmodel

43 Extracting Sentence-Context Features by Bi-LSTM

In informal software texts many API mentions cannot be re-liably recognized by simply examining a tokenrsquos character-level features and word semantics This is because manyAPIs are named using common English words (eg seriesapply plot) or common computing terms (eg dataframesigmoid histgram argmax zip list) When such APIs are men-tioned in their simple name this results in a common-word

polesemy issue for API extraction [2] In such situations wehave to disambiguate the API sense of a common word fromthe normal sense of the word

To that end we have to look into the sentence context inwhich an API is mentioned For example by looking intothe sentence context of the two sentences in Fig 1 we candetermine that the ldquoapplyrdquo in the first sentence is an APImention while the ldquoapplyrdquo in the second sentence is not anAPI mention Note that both the preceding and succeedingcontext of the token ldquoapplyrdquo are useful for disambiguatingthe API or the normal sense of the token

We use Recurrent Neural Network (RNN) to extractsentence context features for disambiguating the API or thenormal sense of the word [29] RNN is a class of neuralnetworks where connections between units form directedcycles and it is widely used in software engineering do-main [30] [31] [32] [33] Due to this nature it is especiallyuseful for tasks involving sequential inputs [29] like sen-tences In our task we adopt a Bidirectional RNN (Bi-RNN)architecture [34] [35] which is composed of two LSTMsone takes input from the beginning of the text forward tilla particular token while the other takes input from the endof the text backward till that token

The input to an RNN is a sequence of vectors In ourtask we obtain the input vector of a token in the inputtext by concatenating the character-level embedding of thetoken w and the word-level embedding of the token wie echarw oplus eword

w An RNN recursively maps an inputvector xt and a hidden state htminus1 to a new hidden state htht = f(htminus1 xt) where f is a non-linear activation function(eg an LSTM unit used in this work) A hidden stateis a vector esent isin Rdsent

summarizing the sentence-levelfeatures till the input xt where dsent is the dimension of thehidden state vector to be chosen by the user We denote efsentand ebsent as the hidden states computed by the forwardLSTM and the backward LSTM after reading the end of thepreceding and the succeeding sentence context of a tokenrespectively efsent and ebsent are concatenated into one vectoresentw as the Bi-LSTM output for the corresponding token win the input text

As an input text (eg a Stack Overflow post) can be along text modeling long-range dependencies in the text iscrucial for our task For example a mention of a libraryname at the beginning of a post could be important fordetecting a mention of a method of this library later inthe post Therefore we adopt the LSTM unit [36] [37] inour RNN The LSTM is designed to cope with the gradientvanishing problem in RNN An LSTM unit consists of amemory cell and three gates namely the input outputand forget gates Conceptually the memory cell stores thepast contexts and the input and output gates allow thecell to store contexts for a long period of time Meanwhilesome contexts can be cleared by the forget gate Memorycell and the three gates have weights and bias terms tobe learned during model training Bi-LSTM can extract thecontext feature For instance in the Pandas library sentencerdquoThis can be accomplished quite simply with the DataFramemethod applyrdquo based on the context information from theword method the Bi-LSTM can help our model classifythe word apply as API mention and the learning can be

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 7

transferred across different languages and libraries withtransfer learning

44 API Labeling by Softmax ClassificationGiven the Bi-LSTMrsquos output vector esentw for a token w inthe input text we train a binary softmax classifier to predictwhether the token is an API mention or a normal Englishword That is we have two classes in this work ie API ornon-API Softmax predicts the probability for the j-th classgiven a tokenrsquos vector esentw by

P (j|esentw ) =exp(esentw WT

j )9831232

k=1 exp(esentw WT

k )

where the vectors Wk (k=1 or 2) are parameters to belearned

45 Library Adaptation by Transfer LearningTransfer learning [12] [38] is a machine learning methodthat stores knowledge gained while solving one problemand applying it to a different but related problem It helpsto reduce the amount of training data required for the targetproblem When transfer learning is used in deep learningit has been shown to be beneficial for narrowing down thescope of possible models on a task by using the weights ofa trained model on a different but related task [14] and theshared parameters transfer learning can help model adaptthe shared knowledge in similar context [39]

API extraction tasks for different libraries can be re-garded as a set of different but related tasks Therefore weuse transfer learning to adapt a neural model trained withone libraryrsquos text (source-library-trained model) for anotherlibraryrsquos text (target library) Our neural architecture is com-posed of four main components character-level CNN wordembeddings sentence Bi-LSTM and softmax classifier Wecan transfer all or some source-library-trained model com-ponents to a target library model Without transfer learn-ing the parameters of a target-library model componentwill be randomly initialized and then learned using thetarget-library training data ie trained from scratch Withtransfer learning we use the parameters of source-library-trained model components to initialize the parameters ofthe corresponding target-library model components Aftertransferring the model parameters we can either freeze thetransferred parameters or fine-tune the transferred parame-ters using the target-library training data

5 SYSTEM IMPLEMENTATION

This section describes the current implementation 2 of ourneural architecture for API extraction

51 Preprocessing and Tokenizing Input TextOur current implementation takes as input the content of aStack Overflow post As we want to recognize API mentionswithin natural language sentences we remove stand-alonecode snippets in ltpregtltcodegt tag We then remove allHTML tags in the post content to obtain the plain text

2 httpsgithubcomJOJO201API Extraction

input (see Section 3 for the justification of taking plain textas input) We develop a sentence parser to split the pre-processed post text into sentences by punctuation Follow-ing [2] we develop a software-specific tokenizer to tokenizethe sentences This tokenizer preserves the integrity of code-like tokens For example it treats matplotlibpyplotimshow()as a single token instead of a sequence of 7 tokens ieldquomatplotlibrdquo ldquordquo ldquopyplotrdquo ldquordquo ldquoimshowrdquo ldquo(rdquo ldquo)rdquo producedby general English tokenizers

52 Model ConfigurationWe now describe the hyper-parameter settings used in ourcurrent implementation These hyper-parameters are tunedusing the validation data during model training (see Sec-tion 612 for the description of our dataset) We find thatthe neural model has very similar performance across sixlibraries dataset with the same hyper-parameters Thereforewe keep the same hyper-parameters for the six librariesdataset which can also avoid the difficulty in scaling dif-ferent hyper-parameters

521 Character-level CNNWe set the filter window size h = 3 That is the convolutionoperation extracts local features from 3 adjacent charactersat a time The size of our current character vocabulary V char

is 92 Thus we set the dimensionality of the character em-bedding dchar at 92 We initialize the character embeddingwith one-hot vector (1 at one character index and zeroin all other dimensions) The character embeddings willbe updated through back propagation during the trainingof character-level CNN We experiment 5 different N (thenumber of filters) 20 40 60 80 100 With N = 40 theCNN has an acceptable performance on the validation dataWith N = 60 and above the CNN has almost the sameperformance as N = 40 but it takes more training epochsto research the acceptable performance Therefore we useN = 40 That is the character-level embedding of a tokenhas the dimension 40

522 Pre-trained word embeddingsOur experiments involve six Python libraries Pandas Mat-plotlib NumPy OpenGL React and JDBC We collect allquestions tagged with these six libraries and all answers tothese questions in the Stack Overflow Data Dump releasedon March 18 2018 We obtain a text corpus of 380971 postsWe use the same preprocessing and tokenization steps asdescribed in Section 51 to preprocess and tokenize thecontent of these posts Then we use the GloVe [28] to learnword embeddings from this text corpus We set the wordvocabulary size |V word| at 40000 The training epoch is setat 100 to ensure the sufficient training of word embeddingsWe experiment four different dimensions of word embed-dings dword 50 100 200 400 We use dword = 200 inour current implementation as it produces a good balanceof the training efficiency and the quality of word embed-dings for API extraction on the validation data We alsoexperiment pre-trained word embeddings with all StackOverflow posts which does not significantly affect the APIextraction performance but requires much longer time fortext preprocessing and word embeddings learning

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 8

523 Sentence-context Bi-LSTMFor the RNN we use 50 hidden LTSM units to store thehidden states The dimension of the hidden state vector is50 Therefore the dimension of the output vector esentw is100 (concatenating forward and backward hidden states) Inorder to mitigate overfitting [40] a dropout layer is addedon the output of BLSTM with the dropout rate 05

53 Model TrainingTo train our neural model we use input text and itscorresponding sequence of APInon-API labels (see Sec-tion 612 for our data labeling process) The optimizerused is Adam [41] which performs well for training RNNincluding Bi-LSTM [35] The training epoch is set at 40 timesThe best performance model on the validation set is savedfor testing This modelrsquos parameters are also saved for thetransfer learning experiments

6 EXPERIMENTS

We conduct a series of experiments to answer the followingfour research questionsbull RQ1 How well can our neural architecture learn multi-

level features from the input texts Can the learned fea-tures support high-quality API extraction compared withexisting machine learning based API extraction methods

bull RQ2 What is the impact of the three feature extractors(character-level CNN word embeddings and sentence-context Bi-LSTM) on the API extraction performance

bull RQ3 How effective is transfer learning for adapting APIextraction models across libraries of the same languagewith different amount of target-library training data

bull RQ4 How effective is transfer learning for adapting APIextraction models across libraries of different languageswith different amount of target-library training data

61 Experiments SetupThis section describes the libraries used in our experimentshow we prepare training and testing data and the evalua-tion metrics of API extraction performance

611 Studied librariesOur experiments involve three Python libraries (PandasNumPy and Matplotlib) one Java library (JDBC) oneJavaScript library (React) and one C library (OpenGL) Asreported in Section 2 these six libraries support very di-verse functionalities for computer programming and havedistinct API-naming and API mention characteristics Usingthese libraries we can evaluate the effectiveness of ourneural architecture for API extraction in very diverse datasettings We can also gain insights into the effectiveness oftransfer learning for API extraction and the transferabilityof our neural architecture in different settings

612 DatasetWe collect Stack Overflow posts (questions and their an-swers) tagged with pandas numpy matplotlib opengl reactor jdbc as our experimental data We use Stack OverflowData Dump released on March 18 2018 In this data dump

TABLE 3 Basic Statistics of Our Dataset

Library Posts Sentences API mentions TokensMatplotlib 600 4920 1481 47317

Numpy 600 2786 1552 39321Pandas 600 3522 1576 42267Opengl 600 3486 1674 70757JDBC 600 4205 1184 50861React 600 3110 1262 42282Total 3600 22029 8729 292805

380971 posts are tagged with one of the six studied librariesOur data collection process follows four criteria First thenumber of posts selected and the number of API mentionsin these posts should be at the same order of magnitudeSecond API mentions should exhibit the variations of APIwriting forms commonly seen on Stack Overflow Third theselection should avoid repeatedly selecting frequently men-tioned APIs Fourth the same post should not appear in thedataset for different libraries (one post may be tagged withtwo or more studied-library tags) The first criterion ensuresthe fairness of comparison across libraries the second andthird criteria ensure the representativeness of API mentionsin the dataset and the fourth criterion ensures that there isno repeated data in different datasets

We finally include 3600 posts (600 for each library) in ourdataset and each post has at least one API mention3 Table 3summarizes the basic statistics of our dataset These postshave in total 22029 sentences 292805 token occurrencesand 41486 unique tokens after data preprocessing and to-kenization We manually label the API mentions in theselected posts 6421 sentences (3407) contains at least oneAPI mention Our dataset has in total 8729 API mentionsThe selected Matplotlib NumPy Pandas OpenGL JDBC andReact posts have 1481 1552 1576 1674 1184 and 1262 APImentions which refer to 553 694 293 201 282 and 83unique APIs of the six libraries respectively In additionwe randomly sample 100 labelled Stack Overflow posts foreach library dataset Then we examine each API mention todetermine whether it is a simple name or not and we findthat among the 600 posts there are 1634 API mentions ofwhich 694 (4247) are simple API names Our dataset notonly contains a large number of API mentions in diversewriting forms but also contains rich discussion contextaround API mentions This makes it suitable for the trainingof our neural architecture the testing of its performanceand the study of model transferability

We randomly split the 600 posts of each library into threesubsets by 622 ratio 60 as training data 20 as validationdata for tuning model hyper-parameters and 20 as testingdata to answer our research questions Since cross validationwill cost a great deal of time and our training dataset is un-biased (proved in Section 64) cross validation is not usedOur experiment results also show that the training datasetis enough to train a high-performance neural networkBesides the performance of the model in target library canbe improved by adding more training dataset from otherlibraries And if more training dataset is added there is noneed to label a new validate dataset for the target libraryto re-tune the hyper-parameters because we find that the

3 httpsdrivegooglecomdrivefolders1f7ejNVUsew9l9uPCj4Xv5gMzNbqttpoausp=sharing

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 9

TABLE 4 Comparison of CRF Baselines [2] and Our Neural Model for API Extraction

Basic CRF Full CRF Our method

Library Prec Recall F1 Prec Recall F1 Prec Recall F1

Matplotlib 6735 4784 5662 8962 6131 7282 815 8392 827NumPy 7583 3991 5229 8921 4931 6351 7824 6291 6974Pandas 6206 7049 6601 9711 7121 8216 8273 853 8280Opengl 4391 9187 5942 9457 7062 8085 8583 8374 8477JDBC 1583 8261 2658 8732 5140 6471 8469 5531 6692React 1674 8858 2816 9742 7011 8153 8795 7817 8277

neural model has very similar performance with the samehyper-parameters under different dataset settings

613 Evaluation metricsWe use precision recall and F1-score to evaluate the per-formance of an API extraction method Precision measureswhat percentage the recognized API mentions by a methodare correct Recall measures what percentage the API men-tions in the testing dataset are recognized correctly by amethod F1-score is the harmonic mean of precision andrecall ie 2 lowast ((precision lowast recall)(precision+ recall))

62 Performance of Our Neural Model (RQ1)Motivation Recently linear-chain CRF has been used tosolve the API extraction problem in informal text [2] Theyshow that machine-learning based API extraction with thatof several outperforms several commonly-used rule-basedmethods [2] [42] [43] The approach in [2] relies on human-defined orthographic features two different unsupervisedlanguage models (class-based Brown clustering and neural-network based word embedding) and API gazetteer features(API inventory) In contrast our neural architecture usesneural networks to automatically learn character- word-and sentence-level features from the input texts The firstRQ is to confirm the effectiveness of these neural-networkfeature extractors for API extraction tasks by comparingthe overall performance of our neural architecture with theperformance of the linear-chain CRF with human-definedfeatures for API extractionApproach We use the implementation of the CRF modelin [2] to build the two CRF baselines The basic CRF baselineuses only the orthographic features developed in [2] Thefull CRF baseline uses all orthographic word-clusters andAPI gazetteer features developed in [2] The basic CRF base-line is easy to deploy because it uses only the orthographicfeatures in the input texts but not any advanced word-clusters and API gazetteer features And the self-trainingprocess is not used in these two baselines since we havesufficient training data In contrast the full CRF baselineuses advanced features so that it is not as easy-to-deploy asthe basic CRF But the full CRF has much better performancethan the basic CRF as reported in [2] Comparing our modelwith these two baselines we can understand whether ourmodel can achieve a good tradeoff between easy-to-deployand high performance Note that as Ye et al [2] alreadydemonstrates the superior performance of machine-learningbased API extraction methods over rule-based methods wedo not consider rule-based baselines for API extraction inthis studyResults From Table 4 we can see that

bull Although linear CRF with full features has close performance toour model linear CRF with only orthographic features performspoorly in distinguishing API tokens from non-API tokens Thebest F1-score of the basic CRF is only 066 for PandasThe F1-score of the basic CRF for Matplotlib Numpy andOpenGL is 052-059 For JDBC and React the F1-score ofthe basic CRF is below 03 Our results are consistent withthe results in [2] when advanced word-clusters and API-gazetteer features are ablated Compared with the basicCRF the full CRF performs much better For Pandas JDBCand React the full CRF has very close performance to ourmodel But for Matplotlib Numpy and OpenGL our modelstill outperforms the full CRF by at least four points inF1-score

bull Multi-level feature embedding by our neural architecture iseffective in distinguishing API tokens from non-API tokensin the resulting embedding space All evaluation metrics ofour neural architecture are significantly higher than thoseof the basic CRF Although our neural architecture per-forms relatively worse on NumPy and JDBC its F1-scoreis still much higher than the F1-score of the basic CRFon NumPy and JDBC We examine false positives (non-API token recognized as API) and false negatives (APItoken recognized as non-API) by our neural architectureon NumPy and JDBC We find that many false positivesand false negatives involve API mentions composed ofcomplex strings for example array expressions for Numpyor SQL statements for JDBC It seems that neural networkslearn some features from such complex strings that mayconfuse the model and cause the failure to tell apart APItokens from non-API ones Furthermore by analysing theclassification results for different API mentions we findthat our model has good generalizability on different APIfunctions

With linear CRF for API extraction we cannot achieve agood tradeoff between easy-to-deploy and high performance Incontrast our neural architecture can extract effective character-word- and sentence-level features from the input texts and theextracted features alone can support high-quality API extractionwithout using any hand-crafted advanced features such as wordclusters API gazetteers

63 Impact of Feature Extractors (RQ2)Motivation Although our neural architecture achieves verygood performance as a whole we would like to furtherinvestigate how much different feature extractors (character-level CNN word embeddings and sentence-context Bi-LSTM) contribute to this good performance and how dif-ferent features affect precision and recall of API extraction

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 10

TABLE 5 The Results of Feature Ablation Experiments

Ablating char CNN Ablating Word Embeddings Ablating Bi-LSTM All featuresPrec Recall F1-score Prec Recall F1-score Prec Recall F1-score Prec Recall F1-score

Matplotlib 7570 7232 7398 8175 6399 7179 8259 6934 7544 8150 8392 8270Numpy 8100 6040 6921 7931 5847 6731 8062 5667 6644 7824 6291 6974Pandas 8050 8328 8182 8177 6801 7425 8322 7522 7965 8273 8530 8280Opengl 7758 8537 8129 8307 6803 7504 9852 7208 8305 8583 8374 8477JDBC 6865 6125 6447 6422 6562 6491 9928 4312 6613 8469 5531 6692React 6990 8571 7701 8437 7500 7941 9879 6508 7847 8795 7817 8277

Approach We ablate one kind of feature at a time fromour neural architecture That is we obtain a model withoutcharacter-level CNN one without word embeddings andone without sentence-context Bi-LSTM We compare theperformance of these three models with that of the modelwith all three feature extractors

Results In Table 5 we highlight in bold the largest drop ineach metric for a library when ablating a feature comparedwith that metric with all features We underline the increasein each metric for a library when ablating a feature com-pared with that metric with all features We can see that

bull The performance of our neural architecture is contributed bythe combined action of all its features Ablating any of thefeatures the F1-score degrades Ablating word embed-dings or Bi-LSTM causes the largest drop in F1-scorefor four libraries while ablating char-CNN causes thelargest drop in F1-score for two libraries Feature ablationhas higher impact on some libraries than others Forexample ablating char-CNN word embeddings and Bi-LSTM all cause significant drop in F1-score for Matplotliband ablating word embeddings causes significant dropin F1-score for Pandas and React In contrast ablating afeature causes relative minor drop in F1-score for NumpyJDBC and React This indicates different levels of featureslearned by our neural architecture can be more distinct forsome libraries but more complementary for others Whendifferent features are more distinct from one anotherachieving good API extraction performance relies moreon all the features

bull Different features play different roles in distinguishing APItokens from non-API tokens Ablating char-CNN causesthe drop in precision for five libraries except NumpyThe precision degrades rather significantly for Matplotlib(71) OpenGL (96) JDBC (189) and React (205)In contrast ablating char-CNN causes significant drop inrecall only for Matplotlib (138) For JDBC and React ab-lating char-CNN even causes significant increase in recall(107 and 96 respectively) Different from the impactof ablating char-CNN ablating word embeddings and Bi-LSTM usually causes the largest drop in recall rangingfrom 99 drop for Numpy to 237 drop for MatplotlibAblating word embeddings causes significant drop inprecision only for JDBC and it causes small changes inprecision for the other five libraries (three with small dropand two with small increase) Ablating Bi-LSTM causesthe increase in precision for all six libraries Especiallyfor OpenGL JDBC and React when ablating Bi-LSTM themodel has almost perfect precision (around 99) but thiscomes with a significant drop in recall (139 for OpenGL193 for JDBC and 167 React)

All feature extractors are important for high-quality API ex-traction Char-CNN is especially useful for filtering out non-API tokens while word embeddings and Bi-LSTM are especiallyuseful for recalling API tokens

64 Effectiveness of Within-Language Transfer Learn-ing (RQ3)Motivation As described in Section 611 we intentionallychoose three different-functionality Python libraries PandasMatplotlib NumPy Pandas and NumPy are functionally sim-ilar while the other two pairs are more distant The threelibraries also have different API-naming and API-mentioncharacteristics (see Section 2) We want to investigate theeffectiveness of transfer learning for API extraction acrossdifferent pairs of libraries and with different amount oftarget-library training dataApproach We use one library as source library and one ofthe other two libraries as target library We have six transfer-learning pairs for the three libraries We denote them assource-library-name rarr target-library-name such as Pandas rarrNumPy which represents transferring Pandas-trained modelto NumPy text Two of these six pairs (Pandas rarr NumPy andNumPy rarr Pandas) are similar-libraries transfer (in terms oflibrary functionalities and API-namingmention character-istics) while the other four pairs involving Matplotlib arerelatively more-distant-libraries transfer

TABLE 6 NumPy (NP) or Pandas (PD) rarr Matplotlib (MPL)NPrarrMPL PDrarrMPL MPL

Prec Recall F1 Prec Recall F1 Prec Recall F111 8264 8929 8584 8102 8006 8058 8767 7827 827012 8184 8452 8316 7161 8333 7696 8138 7024 754014 7183 7589 7381 6788 7798 7265 8122 5536 658418 7056 7560 7298 6966 7381 7171 7500 5357 6250116 7356 7202 7278 6648 7202 6916 8070 2738 4089132 7256 7883 7369 7147 6935 7047 9750 1160 2074DU 7254 6607 6916 7699 5476 6400

TABLE 7 Matplotlib (MPL) or Pandas (PD) rarr NumPy (NP)MPLrarrNP PDrarrNP NP

Prec Recall F1 Prec Recall F1 Prec Recall F111 8685 7708 8168 7751 6750 7216 7824 6292 697412 788 8208 8041 7013 6751 6879 7513 6042 669714 9364 7667 8000 6588 7000 6781 7714 4500 568418 7373 7833 7596 6584 6667 6625 7103 4292 5351116 7619 6667 7111 5833 6417 6114 5707 4875 5258132 7554 5792 6556 6027 5625 5823 7272 2333 3533DU 6416 6525 6471 6296 5932 6108

We use gradually-reduced target-library training data(1 for all data 12 14 132) to fine-tune the source-library-trained model We also train a target-library modelfrom scratch (ie with randomly initialized model parame-ters) using the same proportion of the target-library trainingdata for comparison We also use the source-library-trained

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 11

TABLE 8 Matplotlib (MPL) or NumPy (NP) rarr Pandas (PD)MPLrarrPD NPrarrPD PD

Prec Recall F1 Prec Recall F1 Prec Recall F111 8497 8962 8723 8686 8761 8723 8043 8530 828012 8618 8443 8530 8397 8300 8348 8801 7406 804314 8107 8761 8421 8107 8270 8188 7650 7695 767218 8754 8098 8413 8576 7637 8079 6930 8329 7565

116 8204 7896 8047 8245 7579 7898 8421 5072 5331132 8131 7522 7814 8143 7205 7645 8425 3084 4514DU 7165 4006 5139 7500 3545 4814

TABLE 9 Averaged Matplotlib (MPL) or NumPy(NP)rarrPandas (PD)

MPLrarrPD NPrarrPDPrec Recall F1 Prec Recall F1

116 8333 7781 8043 8339 7522 7909132 7950 7378 7653 8615 7115 7830

model directly without any fine-tuning (ie 01 target-library data) as a baseline We also randomly select 116 and132 training data for NumPy rarr Pandas and Matplotlib rarrPandas for 10 times and train the model for each time Thenwe calculate the averaged precision and recall values Basedon the averaged precision and recall values we calculate theaveraged F1-scoreResults Table 6 Table 7 and Table 8 show the experimentresults of the six pairs of transfer learning Table 9 showsthe averaged experiment results of NumPy rarr Pandas andMatplotlib rarr Pandas with 116 and 132 training dataAcronyms stand for MPL (Matplotlib) NP (NumPy) PD(Pandas) DU (Direct Use) The last column is the results oftraining the target-library model from scratch We can seethatbull Transfer learning can produce a better-performance target-

library model than training the target-library model fromscratch with the same proportion of training data This is ev-ident as the F1-score of the target-library model obtainedby fine-tuning the source-library-trained model is higherthan the F1-score of the target-library model trained fromscratch in all the transfer settings but PD rarr MPL at11 Especially for NumPy the NumPy model trainedfrom scratch with all Numpy training data has F1-score6974 However the NumPy model transferred from theMatplotlib model has F1-score 8168 (171 improvement)For the Matplotlib rarr Nump transfer even with only 116Numpy training data for fine-tuning the F1-score (7111) ofthe transferred NumPy model is still higher than the F1-score of the NumPy model trained from scratch with allNumpy training data This suggests that transfer learningmay boost the model performance even for the difficultdataset Such performance boost by transfer learning hasalso been observed in many studies [14] [44] [45] Thereason is that a target-library model can ldquoreuserdquo muchknowledge in the source-library model rather than hav-ing to learning completely new knowledge from scratch

bull Transfer learning can reduce the amount of target-library train-ing data required to obtain a high-quality model For four outof six transfer-learning pairs (NP rarr MPL MPL rarr NP NPrarr PD MPL rarr PD) the reduction ranges from 50 (12)to 875 (78) while the resulting target-library model stillhas better F1-score than the target-library model trainedfrom scratch using all target-library training data If we al-low a small degradation of F1-score (eg 3 of the F1-sore

for the target-library model trained from scratch usingall target-library training data) the reduction can go upto 938 (1516) For example for Matplotlib rarr Pandasusing 116 Pandas training data (ie only about 20 posts)for fine-tuning the obtained Pandas model still has F1-score 8047

bull Transfer learning is very effective in few-shot training settingsFew-shot training refers to training a model with onlya very small amount of data [46] In our experimentsusing 132 training data ie about 10 posts for transferlearning the F1-score of the obtained target-library modelis still improved a few points for NP rarr MPL PD rarr MPLand MPL rarr NP and is significantly improved for MPL rarrPD (529) and NP rarr PD (583) compared with directlyreusing source-library-trained model without fine-tuning(DU row) Furthermore the averaged results of NP rarrPD and MPL rarr PD for 10 randomly selected 116 and132 training data are similar to the results in Table 8and the variances are all smaller than 03 which showsour training data is unbiased Although the target-librarymodel trained from scratch still has reasonable F1-scorewith 12 to 18 training data they become completelyuseless with 116 or 132 training data In contrast thetarget-library models obtained through transfer learninghave significant better F1-score in such few shot settingsFurthermore training a model from scratch using few-shot data may result in abnormal increase in precision(eg MPL at 116 and 132 NP at 132 PD at 132) or inrecall (eg NP at 116) compared with training the modelfrom scratch using more data This abnormal increase inprecision (or recall) comes with a sharp decrease in recall(or precision) Our analysis shows that this phenomenonis caused by the biased training data in few-shot settingsIn contrast the target-library models obtained throughtransfer learning can reuse knowledge in the source modeland produce a much more balanced precision and recall(thus much better F1-score) even in the face of biased few-shot training data

bull The effectiveness of transfer learning does not correlate withthe quality of source-library-trained model Based on a not-so-high-quality source model (eg NumPy) we can still ob-tain a high-quality target model through transfer learningFor example NP rarr PD results in a Pandas model with F1-score gt 80 using only 18 of Pandas training data On theother hand a high-quality source model (eg Pandas) maynot boost the performance of the target model Amongall 36 transfer settings we have one such case ie PDrarr MPL at 11 The Matplotlib model transferred fromthe Pandas model using all Matplotlib training data isslightly worse than the Matplotlib model trained fromscratch using all training data This can be attributed tothe differences of API naming characteristics between thetwo libraries (see the last bullet)

bull The more similar the functionalities and the API-namingmention characteristics between the two libraries arethe more effectiveness the transfer learning can be Pan-das and NumPy have similar functionalities and API-namingmention characteristics For PD rarr NP and NPrarr PD the target-library model is less than 5 worse inF1-score with 18 target-library training data comparedwith the target-library model trained from scratch using

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 12

all training data In contrast PD rarr MPL has 69 dropin F1-score with 12 training data and NP rarr MPL has107 drop in F1-score with 14 training data comparedwith the Matplotlib model trained from scratch using alltraining data

bull Knowledge about non-polysemous APIs seems easy to expandwhile knowledge about polysemous APIs seems difficult toadapt Matplotlib has the least polysemous APIs (16)and Pandas and NumPy has much more polysemous APIsBoth MPL rarr PD and MPL rarr NP have high-qualitytarget models even with 18 target-library training dataThe transfer learning seems to be able to add the newknowledge ldquocommon words can be APIsrdquo to the target-library model In contrast NP rarr MPL requires at 12training data to have a high-quality target model (F1-score8316) and for PD rarr MPL the Matplotlib model (F1-score8058) by transfer learning with all training data is evenslightly worse than the model trained from scratch (F1-score 827) These results suggest that it can be difficultto adapt the knowledge ldquocommon words can be APIsrdquolearned from NumPy and Pandas to Matplotlib which doesnot have many common-word APIs

Transfer learning can effectively boost the performance of target-library model with less demand on target-library training dataIts effectiveness correlates more with the similarity of libraryfunctionalities and and API-namingmention characteristicsthan the initial quality of source-library model

65 Effectiveness of Across-Language Transfer Learn-ing (RQ4)Motivation In the RQ3 we investigate the transferabilityof API extraction model across three Python libraries Inthe RQ4 we want to further investigate the transferabilityof API extraction model in a more challenging setting ieacross different programming languages and across librarieswith very different functionalitiesApproach For the source language we choose Python oneof the most popular programming languages We randomlyselect 200 posts in each dataset of the three Python librariesand combine these 600 posts as the training data for PythonAPI extraction model For the target languages we choosethree other popular programming languages Java JavaScriptand C That is we have three transfer-learning pairs in thisRQ Python rarr Java Python rarr JavaScript and Python rarrC For the three target languages we intentionally choosethe libraries that support very different functionalities fromthe three Python libraries For Java we choose JDBC (anAPI for accessing relational database) For JavaScript wechoose React (a library for web graphical user interface)For C we choose OpenGL (a library for 3D graphics) Asdescribed in Section 612 we label 600 posts for each target-language library for the experiments As in the RQ3 weuse gradually-reduced target-language training data to fine-tune the source-language-trained modelResults Table 10 Table 11 and Table 12 show the experimentresults for the three across-language transfer-learning pairsThese three tables use the same notation as Table 6 Table 7and Table 8 We can see thatbull Directly reusing the source-language-trained model on the

target-language text produces unacceptable performance For

TABLE 10 Python rarr Java

PythonrarrJava JavaPrec Recall F1 Prec Recall F1

11 7745 6656 7160 8469 5531 669212 7220 6250 6706 7738 5344 632214 6926 5000 5808 7122 4719 567718 5000 6406 5616 7515 3969 5194

116 5571 4825 5200 7569 3406 4698132 5699 4000 4483 7789 2312 3566DU 4444 2875 3491

TABLE 11 Python rarr JavaScript

PythonrarrJavaScript JavaScriptPrec Recall F1 Prec Recall F1

11 7745 8175 8619 8795 7817 827712 8693 7659 8143 8756 5984 777014 8684 6865 7424 8308 6626 737218 8148 6111 6984 8598 5595 6788

116 7111 6349 6808 8738 3848 5344132 6667 5238 5867 6521 3571 4615DU 5163 2500 3369

the three Python libraries directly reusing a source-library-trained model on the text of a target-library maystill have reasonable performance for example NP rarrMPL (F1-score 6916) PD rarr MPL (640) MPL rarr NP(6471) However for the three across-language transfer-learning pairs the F1-score of direct model reuse is verylow Python rarr Java (3491) Python rarr JavaScript (3369)Python rarr C (3595) These results suggest that there arestill certain level of commonalities between the librarieswith similar functionalities and of the same programminglanguage so that the knowledge learned in the source-library model may be largely reused in the target-librarymodel In contrast the libraries of different functionali-ties and programming languages have much fewer com-monalities which makes it infeasible to directly deploya source-language-trained model to the text of anotherlanguage In such cases we have to either train the modelfor each language or library from scratch or we may ex-ploit transfer learning to adapt a source-language-trainedmodel to the target-language text

bull Across-language transfer learning holds the same performancecharacteristics as within-language transfer learning but itdemands more target-language training data to obtain a high-quality model than within-language transfer learning ForPython rarr JavaScript and Python rarr C with as minimumas 116 target-language training data we can obtaina fine-tuned target-language model whose F1-score candouble that of directly reusing the Python-trained modelto JavaScript or C text For Python rarr Java fine-tuning thePython-trained model with 116 Java training data booststhe F1-score by 50 compared with directly applying thePython-trained model to Java text For all the transfer-learning settings in Table 10 Table 11 and Table 12the fine-tuned target-language model always outperformsthe corresponding target-language model trained fromscratch with the same amount of target-language trainingdata However it requires at least 12 target-languagetraining data to fine-tune a target-language model toachieve the same level of performance as the target-language model trained from scratch with all target-

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 13

TABLE 12 Python rarr C

PythonrarrC CPrec Recall F1 Prec Recall F1

11 8987 8509 8747 8583 8374 847712 8535 8373 8454 8628 7669 812114 8383 7588 7965 8219 7127 763418 7432 7371 7401 7868 6802 7297

116 7581 6945 7260 8852 5710 6942132 6962 6585 6779 8789 4525 5975DU 5649 2791 3595

language training data For the within-language transferlearning achieving this result may require as minimum as18 target-library training data

Software text of different programming languages and of li-braries with very different functionalities increases the difficultyof transfer learning but transfer learning can still effectivelyboost the performance of the target-language model with 12less target-language training data compared with training thetarget-language model from scratch with all training data

66 Threats to Validity

The major threat to internal validity is the API labelingerrors in the dataset In order to decrease errors the first twoauthors first independently label the same data and thenresolve the disagreements by discussion Furthermore inour experiments we have to examine many API extractionresults We only spot a few instances of overlooked orerroneously-labeled API mentions Therefore the quality ofour dataset is trustworthy

The major threat to external validity is the generalizationof our results and findings Although we invest significanttime and effort to prepare datasets conduct experimentsand analyze results our experiments involve only six li-braries of four programming languages The performance ofour neural architecture and especially the findings on trans-fer learning could be different with other programminglanguages and libraries Furthermore the software text inall the experiments comes from a single data source ieStack Overflow Software text from other data sources mayexhibit different API-mention and discussion-context char-acteristics which may affect the model performance In thefuture we will reduce this threat by applying our approachto more languageslibraries and informal software text fromother data sources (eg developer emails)

7 RELATED WORK

APIs are the core resources for software development andAPI related knowledge is widely present in software textssuch as API reference documentation QampA discussionsbug reports Researchers have proposed many API extrac-tion methods to support various software engineering tasksespecially document traceability recovery [42] [47] [48][49] [50] For example RecoDoc [3] extracts Java APIsfrom several resources and then recover traceability acrossdifferent sources Subramanian et al [4] use code contextinformation to filter candidate APIs in a knowledge basefor an API mention in a partial code fragment Christophand Robillard [51] extracts sentences about API usage from

Stack Overflow and use them to augment API referencedocumentation These works focus mainly on the API link-ing task ie linking API mentions in text or code to someAPI entities in a knowledge base In terms of extracting APImentions in software text they rely on rule-based methodsfor example regular expressions of distinct orthographicfeatures of APIs such as camelcase special characters (eg or ()) and API annotations

Several studies such as Bacchelli et al [42] [43] and Yeet al [2] show that regular expressions of distinct ortho-graphic features are not reliable for API extraction tasks ininformal texts such as emails Stack Overflow posts Bothvariations of sentence formats and the wide presence ofmentions of polysemous API simple names pose a greatchallenge for API extraction in informal texts Howeverthis challenge is generally avoided by considering onlyAPI mentions with distinct orthographic features in existingworks [3] [51] [52]

Island parsing provides a more robust solution for ex-tracting API mentions from texts Using an island grammarwe can separate the textual content into constructs of in-terest (island) and the remainder (water) [53] For exampleBacchelli et al [54] uses island parsing to extract code frag-ments from natural language text Rigby and Robillard [52]also use island parser to identify code-like elements that canpotentially be APIs However these island parsers cannoteffectively deal with mentions of API simple names that arenot suffixed by () such as apply and series in Fig 1 whichare very common writing forms of API methods in StackOverflow discussions [2]

Recently Ye et al [10] proposes a CRF based approachfor extracting methods of software entities such as pro-gramming languages libraries computing concepts andAPIs from informal software texts They report that extractAPI mentions is much more challenging than extractingother types of software entities A follow-up work by Yeet al [2] proposes to use a combination of orthographicword-clusters and API gazetteer features to enhance theCRFrsquos performance on API extraction tasks Although theseproposed features are effective they require much manualeffort to develop and tune Furthermore enough trainingdata has to be prepared and manually labeled for applyingtheir approach to the software text of each library to beprocessed These overheads pose a practical limitation todeploying their approach to a larger number of libraries

Our neural architecture is inspired by recent advancesof neural network techniques for NLP tasks For exampleboth RNNs and CNNs have been used to embed character-level features for question answering [55] [56] machinetranslation [57] text classification [58] and part-of-speechtagging [59] Some researchers also use word embeddingsand LSTMs for NER [35] [60] [61] To the best of our knowl-edge our neural architecture is the first machine learningbased API extraction method that combines these proposalsand customize them based on the characteristics of soft-ware texts and API names Furthermore the design of ourneural architecture also takes into account the deploymentoverhead of the API methods for multiple programminglanguages and libraries which has never been explored forAPI extraction tasks

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 14

8 CONCULSION

This paper presents a novel multi-layer neural architecturethat can effectively learn important character- word- andsentence-level features from informal software texts for ex-tracting API mentions in text The learned features has supe-rior performance than human-defined orthographic featuresfor API extraction in informal software texts This makes ourneural architecture easy to deploy because the only input itrequires is the texts to be processed In contrast existingmachine learning based API extraction methods have touse additional hand-crafted features such as word clustersor API gazetteers in order to achieve the performanceclose to that of our neural architecture Furthermore as thefeatures are automatically learned from the input texts ourneural architecture is easy to transfer and fine-tune acrossprogramming languages and libraries We demonstrate itstransferability across three Python libraries and across fourprogramming languages Our neural architecture togetherwith transfer learning makes it easy to train and deploya high-quality API extraction model for multiple program-ming languages and libraries with much less overall effortrequired for preparing training data and effective featuresIn the future we will further investigate the performanceand the transferability of our neural architecture in manyother programming languages and libraries moving to-wards real-world deployment of machine learning basedAPI extraction methods

REFERENCES

[1] C Parnin C Treude L Grammel and M-A Storey ldquoCrowddocumentation Exploring the coverage and the dynamics of apidiscussions on stack overflowrdquo

[2] D Ye Z Xing C Y Foo J Li and N Kapre ldquoLearning to extractapi mentions from informal natural language discussionsrdquo in Soft-ware Maintenance and Evolution (ICSME) 2016 IEEE InternationalConference on IEEE 2016 pp 389ndash399

[3] B Dagenais and M P Robillard ldquoRecovering traceability linksbetween an api and its learning resourcesrdquo in Software Engineering(ICSE) 2012 34th International Conference on IEEE 2012 pp 47ndash57

[4] S Subramanian L Inozemtseva and R Holmes ldquoLive api doc-umentationrdquo in Proceedings of the 36th International Conference onSoftware Engineering ACM 2014 pp 643ndash652

[5] C Chen Z Xing and X Wang ldquoUnsupervised software-specificmorphological forms inference from informal discussionsrdquo in Pro-ceedings of the 39th International Conference on Software EngineeringIEEE Press 2017 pp 450ndash461

[6] X Chen C Chen D Zhang and Z Xing ldquoSethesaurus Wordnetin software engineeringrdquo IEEE Transactions on Software Engineer-ing 2019

[7] H Li S Li J Sun Z Xing X Peng M Liu and X Zhao ldquoIm-proving api caveats accessibility by mining api caveats knowledgegraphrdquo in Software Maintenance and Evolution (ICSME) 2018 IEEEInternational Conference on IEEE 2018

[8] Z X D L X W Qiao Huang Xin Xia ldquoApi method recom-mendation without worrying about the task-api knowledge gaprdquoin Automated Software Engineering (ASE) 2018 33th IEEEACMInternational Conference on IEEE 2018

[9] B Xu Z Xing X Xia and D Lo ldquoAnswerbot automated gen-eration of answer summary to developersz technical questionsrdquoin Proceedings of the 32nd IEEEACM International Conference onAutomated Software Engineering IEEE Press 2017 pp 706ndash716

[10] D Ye Z Xing C Y Foo Z Q Ang J Li and N Kapre ldquoSoftware-specific named entity recognition in software engineering socialcontentrdquo in 2016 IEEE 23rd International Conference on SoftwareAnalysis Evolution and Reengineering (SANER) vol 1 IEEE 2016pp 90ndash101

[11] E F Tjong Kim Sang and F De Meulder ldquoIntroduction to the conll-2003 shared task Language-independent named entity recogni-tionrdquo in Proceedings of the seventh conference on Natural languagelearning at HLT-NAACL 2003-Volume 4 Association for Compu-tational Linguistics 2003 pp 142ndash147

[12] R Caruana ldquoLearning many related tasks at the same time withbackpropagationrdquo in Advances in neural information processing sys-tems 1995 pp 657ndash664

[13] A Arnold R Nallapati and W W Cohen ldquoExploiting featurehierarchy for transfer learning in named entity recognitionrdquo Pro-ceedings of ACL-08 HLT pp 245ndash253 2008

[14] J Yosinski J Clune Y Bengio and H Lipson ldquoHow transferableare features in deep neural networksrdquo in Advances in neuralinformation processing systems 2014 pp 3320ndash3328

[15] Z X Deheng Ye Lingfeng Bao and S-W Lin ldquoApireal An apirecognition and linking approach for online developer forumsrdquoEmpirical Software Engineering 2018 2018

[16] C dos Santos and M Gatti ldquoDeep convolutional neural networksfor sentiment analysis of short textsrdquo in Proceedings of COLING2014 the 25th International Conference on Computational LinguisticsTechnical Papers 2014 pp 69ndash78

[17] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[18] G Chen C Chen Z Xing and B Xu ldquoLearning a dual-languagevector space for domain-specific cross-lingual question retrievalrdquoin 2016 31st IEEEACM International Conference on Automated Soft-ware Engineering (ASE) IEEE 2016 pp 744ndash755

[19] C Chen X Chen J Sun Z Xing and G Li ldquoData-driven proac-tive policy assurance of post quality in community qampa sitesrdquoProceedings of the ACM on human-computer interaction vol 2 noCSCW p 33 2018

[20] I Bazzi ldquoModelling out-of-vocabulary words for robust speechrecognitionrdquo PhD dissertation Massachusetts Institute of Tech-nology 2002

[21] C Chen S Gao and Z Xing ldquoMining analogical libraries in qampadiscussionsndashincorporating relational and categorical knowledgeinto word embeddingrdquo in 2016 IEEE 23rd international conferenceon software analysis evolution and reengineering (SANER) vol 1IEEE 2016 pp 338ndash348

[22] C Chen and Z Xing ldquoSimilartech automatically recommendanalogical libraries across different programming languagesrdquo in2016 31st IEEEACM International Conference on Automated SoftwareEngineering (ASE) IEEE 2016 pp 834ndash839

[23] C Chen Z Xing and Y Liu ldquoWhats spains paris mining analogi-cal libraries from qampa discussionsrdquo Empirical Software Engineeringvol 24 no 3 pp 1155ndash1194 2019

[24] Y Huang C Chen Z Xing T Lin and Y Liu ldquoTell them apartdistilling technology differences from crowd-scale comparisondiscussionsrdquo in ASE 2018 pp 214ndash224

[25] O Levy and Y Goldberg ldquoNeural word embedding as implicitmatrix factorizationrdquo in Advances in neural information processingsystems 2014 pp 2177ndash2185

[26] D Tang F Wei N Yang M Zhou T Liu and B Qin ldquoLearningsentiment-specific word embedding for twitter sentiment classifi-cationrdquo in Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) vol 1 2014pp 1555ndash1565

[27] O Levy Y Goldberg and I Dagan ldquoImproving distributional sim-ilarity with lessons learned from word embeddingsrdquo Transactionsof the Association for Computational Linguistics vol 3 pp 211ndash2252015

[28] J Pennington R Socher and C Manning ldquoGlove Global vectorsfor word representationrdquo in Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP) 2014 pp1532ndash1543

[29] I Sutskever O Vinyals and Q V Le ldquoSequence to sequencelearning with neural networksrdquo in Advances in neural informationprocessing systems 2014 pp 3104ndash3112

[30] C Chen Z Xing and Y Liu ldquoBy the community amp for the com-munity a deep learning approach to assist collaborative editing inqampa sitesrdquo Proceedings of the ACM on Human-Computer Interactionvol 1 no CSCW p 32 2017

[31] S Gao C Chen Z Xing Y Ma W Song and S-W Lin ldquoA neuralmodel for method name generation from functional descriptionrdquo

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 15

in 2019 IEEE 26th International Conference on Software AnalysisEvolution and Reengineering (SANER) IEEE 2019 pp 414ndash421

[32] X Wang C Chen and Z Xing ldquoDomain-specific machine trans-lation with recurrent neural network for software localizationrdquoEmpirical Software Engineering pp 1ndash32 2019

[33] C Chen Z Xing Y Liu and K L X Ong ldquoMining likelyanalogical apis across third-party libraries via large-scale unsu-pervised api semantics embeddingrdquo IEEE Transactions on SoftwareEngineering 2019

[34] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no 11pp 2673ndash2681 1997

[35] Z Huang W Xu and K Yu ldquoBidirectional lstm-crf models forsequence taggingrdquo arXiv preprint arXiv150801991 2015

[36] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural computation vol 9 no 8 pp 1735ndash1780 1997

[37] H Sak A Senior and F Beaufays ldquoLong short-term memoryrecurrent neural network architectures for large scale acousticmodelingrdquo in Fifteenth annual conference of the international speechcommunication association 2014

[38] Y Bengio ldquoDeep learning of representations for unsupervised andtransfer learningrdquo in Proceedings of ICML Workshop on Unsupervisedand Transfer Learning 2012 pp 17ndash36

[39] S J Pan Q Yang et al ldquoA survey on transfer learningrdquo[40] N Srivastava G Hinton A Krizhevsky I Sutskever and

R Salakhutdinov ldquoDropout A simple way to prevent neural net-works from overfittingrdquo The Journal of Machine Learning Researchvol 15 no 1 pp 1929ndash1958 2014

[41] D P Kingma and J Ba ldquoAdam A method for stochastic optimiza-tionrdquo arXiv preprint arXiv14126980 2014

[42] A Bacchelli M Lanza and R Robbes ldquoLinking e-mails andsource code artifactsrdquo in Proceedings of the 32nd ACMIEEE Inter-national Conference on Software Engineering-Volume 1 ACM 2010pp 375ndash384

[43] A Bacchelli M DrsquoAmbros M Lanza and R Robbes ldquoBench-marking lightweight techniques to link e-mails and source coderdquoin Reverse Engineering 2009 WCRErsquo09 16th Working Conference onIEEE 2009 pp 205ndash214

[44] S J Pan and Q Yang ldquoA survey on transfer learningrdquo IEEETransactions on knowledge and data engineering vol 22 no 10 pp1345ndash1359 2010

[45] M Oquab L Bottou I Laptev and J Sivic ldquoLearning and transfer-ring mid-level image representations using convolutional neuralnetworksrdquo in Computer Vision and Pattern Recognition (CVPR) 2014IEEE Conference on IEEE 2014 pp 1717ndash1724

[46] S Ravi and H Larochelle ldquoOptimization as a model for few-shotlearningrdquo 2016

[47] A Marcus J Maletic et al ldquoRecovering documentation-to-source-code traceability links using latent semantic indexingrdquo in Soft-ware Engineering 2003 Proceedings 25th International Conference onIEEE 2003 pp 125ndash135

[48] H-Y Jiang T N Nguyen X Chen H Jaygarl and C K ChangldquoIncremental latent semantic indexing for automatic traceabil-ity link evolution managementrdquo in Proceedings of the 2008 23rdIEEEACM International Conference on Automated Software Engineer-ing IEEE Computer Society 2008 pp 59ndash68

[49] W Zheng Q Zhang and M Lyu ldquoCross-library api recommen-dation using web search enginesrdquo in Proceedings of the 19th ACMSIGSOFT symposium and the 13th European conference on Foundationsof software engineering ACM 2011 pp 480ndash483

[50] Q Gao H Zhang J Wang Y Xiong L Zhang and H Mei ldquoFixingrecurring crash bugs via analyzing qampa sites (t)rdquo in AutomatedSoftware Engineering (ASE) 2015 30th IEEEACM International Con-ference on IEEE 2015 pp 307ndash318

[51] C Treude and M P Robillard ldquoAugmenting api documentationwith insights from stack overflowrdquo in Software Engineering (ICSE)2016 IEEEACM 38th International Conference on IEEE 2016 pp392ndash403

[52] P C Rigby and M P Robillard ldquoDiscovering essential codeelements in informal documentationrdquo in Proceedings of the 2013International Conference on Software Engineering IEEE Press 2013pp 832ndash841

[53] L Moonen ldquoGenerating robust parsers using island grammarsrdquoin Reverse Engineering 2001 Proceedings Eighth Working Conferenceon IEEE 2001 pp 13ndash22

[54] A Bacchelli A Cleve M Lanza and A Mocci ldquoExtracting struc-tured data from natural language documents with island parsingrdquo

in Automated Software Engineering (ASE) 2011 26th IEEEACMInternational Conference on IEEE 2011 pp 476ndash479

[55] Y Kim Y Jernite D Sontag and A M Rush ldquoCharacter-awareneural language modelsrdquo in AAAI 2016 pp 2741ndash2749

[56] D Lukovnikov A Fischer J Lehmann and S Auer ldquoNeuralnetwork-based question answering over knowledge graphs onword and character levelrdquo in Proceedings of the 26th internationalconference on World Wide Web International World Wide WebConferences Steering Committee 2017 pp 1211ndash1220

[57] W Ling I Trancoso C Dyer and A W Black ldquoCharacter-basedneural machine translationrdquo arXiv preprint arXiv151104586 2015

[58] X Zhang J Zhao and Y LeCun ldquoCharacter-level convolutionalnetworks for text classificationrdquo in Advances in neural informationprocessing systems 2015 pp 649ndash657

[59] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[60] J P Chiu and E Nichols ldquoNamed entity recognition with bidirec-tional lstm-cnnsrdquo arXiv preprint arXiv151108308 2015

[61] G Lample M Ballesteros S Subramanian K Kawakami andC Dyer ldquoNeural architectures for named entity recognitionrdquoarXiv preprint arXiv160301360 2016

Suyu Ma is a research assistant in the Facultyof Information Technology at Monash UniversityHe has research interest in the areas of soft-ware engineering Deep learning and Human-computer Interaction He is currently focusing onimproving the usability and accessibility of mo-bile application He received the BS and MSdegrees from Beijing Technology and BusinessUniversity and the Australian National Universityin 2016 and 2018 respectively And he will bea PhD student at Monash University in 2020

under the supervision of Chunyang Chen

Zhenchang Xing is a Senior Lecturer in the Re-search School of Computer Science AustralianNational University Previously he was an Assis-tant Professor in the School of Computer Sci-ence and Engineering Nanyang TechnologicalUniversity Singapore from 2012-2016 Beforejoining NTU Dr Xing was a Lee Kuan Yew Re-search Fellow in the School of Computing Na-tional University of Singapore from 2009-2012Dr Xings current research area is in the interdis-plinary areas of software engineering human-

computer interaction and applied data analytics Dr Xing has over 100publications in peer-refereed journals and conference proceedings andhave received several distinguished paper awards from top softwareengineering conferences Dr Xing regularly serves on the organizationand program committees of the top software engineering conferencesand he will be the program committee co-chair for ICSME2020

Chunyang Chen is a lecturer (Assistant Pro-fessor) in Faculty of Information TechnologyMonash University Australia His research fo-cuses on software engineering deep learningand human-computer interaction He has pub-lished over 16 papers in referred journals orconferences including Empirical Software Engi-neering ICSE ASE CSCW ICSME SANERHe is a member of IEEE and ACM He receivedACM SIGSOFT distinguished paper award inASE 2018 best paper award in SANER 2016

and best tool demo in ASE 2016 httpschunyang-chengithubio

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 16

Cheng Chen received his Bachelor degree inSoftware Engineering from Northwest University(China) and received Master degree of Com-puting (major in Artificial Intelligence) from theAustralian National University He did some in-ternship projects about Natural Language Pro-cessing at CSIRO from 2016 to 2017 He workedas a research assistant supervised by Dr Zhen-chang Xing at ANU in 2018 Cheng currentlyworks in PricewaterhouseCoopers (PwC) Firmas a senior algorithm engineer of Natural Lan-

guage Processing and Data Analyzing Cheng is interested in NamedEnti ty Extraction Relation Extraction Text summarization and parallelcomputing He is working on knowledge engineering and transaction forNLP tasks

Lizhen Qu is a research fellow in the DialogueResearch lab in the Faculty of Information Tech-nology at Monash University He has exten-sive research experience in the areas of naturallanguage processing multimodal learning deeplearning and Cybersecurity He is currently fo-cusing on information extraction semantic pars-ing and multimodal dialogue systems Prior tojoining Monash University he worked as a re-search scientist at Data61CSIRO where he ledand participated in several research and indus-

trial projects including Deep Learning for Text and Deep Learning forCyber

Guoqiang Li is now an associate professor inschool of software Shanghai Jiao Tong Univer-sity and a guest associate professor in KyushuUniversity He received the BS MS and PhDdegrees from Taiyuan University of TechnologyShanghai Jiao Tong University and Japan Ad-vanced Institute of Science and Technology in2001 2005 and 2008 respectively He workedas a postdoctoral research fellow in the graduateschool of information science Nagoya Univer-sity Japan during 2008-2009 as an assistant

professor in the school of software Shanghai Jiao Tong Universityduring 2009-2013 and as an academic visitor in the department ofcomputer science University of Oxford during 2015-2016 His researchinterests include formal verification programming language theory anddata analytics and intelligence He published more than 40 researchespapers in the international journals and mainstream conferences includ-ing TDSC SPE TECS IJFCS SCIS CSCW ICSE FORMATS ATVAetc

Page 2: JOURNAL OF LA Easy-to-Deploy API Extraction by Multi-Level ... · sufficient high-quality training data for APIs of some less frequently discussed libraries or frameworks. Another

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 2

Fig 1 Illustrating API Extraction Task

library often requires several hundreds of manually labeledsentences mentioning this libraryrsquos APIs [2] The effort toprepare training data for hundreds of libraries would beprohibitive Furthermore it may also be difficult to preparesufficient high-quality training data for APIs of some lessfrequently discussed libraries or frameworks

Another related challenge is to select effective featuresfor a machine learning model to recognize a particularlibraryrsquos APIs Although developers follow general namingconventions orthographic features of APIs still vary greatlyfrom one library to another For example as reported inSection 2 different libraries have different percentages ofpolysemous API names Furthermore users of some li-braries tend to mention APIs with clear orthographic fea-tures (eg package names bracket andor dot) while usersof other libraries tend to directly mention API simple namesFunctionalities of software libraries also vary greatly suchas graphical user interface numeric computation machinelearning data visualization database access As such dis-cussion contexts of a libraryrsquos APIs like Pandas (a Pythonmachine learning library) often differ from those of anotherlibraryrsquos APIs like JDBC (a Java database access library)

Designers of a machine learning based API extractionmodel have to manually select the most effective featuresfor different librariesrsquo APIs This is a challenging task asthere are dozens of features to choose from1 Unsupervisedword embeddings have been explored for API extractiontasks [2] but there has been no work on exploiting character-and sentence-context embeddings from input texts for APIextraction Furthermore some advanced features to boostAPI extraction performance such as word clusters and APIgazetteers have to be hand-craftedWithout such advancedfeatures existing machine learning based API extractionmethods perform poorly using only orthographic featuresfrom the input texts [2]

The easy deployment is defined as that the model canbe easily trained for different datasets without requiringany manual feature engineering To make machine learningbased API extraction methods easy to deploy in practicewe must reduce the overhead of preparing training dataand effective features and remove the dependence onadditional features beyond input texts In this paper wedesign a neural architecture for API extraction in informalsoftware text Our neural architecture is composed of thecharacter-level convolutional neural network (CNN) word-

1 httpsnlpstanfordedunlpjavadocjavanlpedustanfordnlpieNERFeatureFactoryhtml

level embeddings and sentence-level Bi-directional LongShort-Term Memory (Bi-LSTM) network for automaticallylearning character- word- and sentence-level features frominput texts respectively This neural architecture can betrained in an end-to-end fashion thus removing the needfor manual feature engineering and the need for addi-tional features beyond input texts and greatly reducing theamount of new training data needed for adapting a modelto different libraries

Furthermore our analysis of the API-naming and API-mention characteristics suggests that the characteristics ofAPI names API mentions and discussion contexts differacross libraries but they also share certain level of common-alities To exploit such commonalities for easy deploymentof API extraction model we adopt transfer learning [12][13] to fine-tune a model trained with a source libraryrsquos APIdiscussion texts to a target library This helps reduce theamount of training data required for training a high-qualitytarget-library model compared with training the modelfrom scratch with randomly initialized model parametersThe design of our multi-level neural architecture enables thefine-tuning of different levels of features in transfer learning

We conduct extensive experiments to evaluate the per-formance of the proposed neural architecture for API ex-traction as well as the effectiveness of transfer learning Ourexperiments involve three Python libraries (Pandas NumPyand Matplotlib) one Java library (JDBC) one JavaScriptlibrary (React) and one C library (OpenGL) As discussedin Section 2 these six libraries support diverse function-alities and have different API-naming API-mention anddiscussion-context characteristics We manually label APImentions in 3600 Stack Overflow posts (600 for each li-brary) for the experiments Our experiments confirm theeffectiveness of our neural architecture in learning multi-level features from the input texts and show that the learnedfeatures can support high-quality API extraction in informalsoftware texts without the need for additional hand-craftedfeatures beyond the input texts Our experiments also con-firm the effectiveness of transfer learning [14] in boosting thetarget-library model performance with much less trainingdata even in few-shot (about 10 posts) training settings

This paper makes the following four contributionsbull Our work is the first one to consider not only the

performance of machine learning based API extrac-tion methods but also the easy deployment of suchmethods for the software text of multiple programminglanguages and libraries

bull We propose a multi-layer neural architecture to auto-matically learn to extract effective features from the in-put texts for API extraction thus removing the need formanual feature engineering as well as the dependenceon features beyond the input texts

bull We adopt transfer learning to reduce the overhead ofmanual labeling of the training data of a subject libraryWe evaluate the effectiveness of transfer learning acrosslibraries and programming languages and analyze thefactors that affect its effectiveness

bull We conduct extensive experiments to evaluate our ar-chitecture as a whole as well its components Ourresults reveal insights into the design of effective mech-anisms for API extraction tasks

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 3

TABLE 1 Statistics of Polysemous APIs

Library APIs PolysemousAPIs

Percentage

Matplotlib 3877 622 1604Pandas 774 426 5504Numpy 2217 917 4136Opengl 850 52 612React 238 157 6597JDBC 1468 633 4312

The remainder of the paper is organized as follows Sec-tion 2 reports our empirical studies of the characteristics ofAPI-names API-mentions and discussion contexts Section 3defines the problem of API extraction Section 4 and Sec-tion 5 describe our neural architecture for API extraction andthe system implementation respectively Section 6 reportsour experiment results and findings Section 7 reviews therelated work Section 8 concludes our work and discuss thefuture work

2 EMPIRICAL STUDIES OF API-NAMING AND API-MENTION CHARACTERISTICS

In this work we aim to develop machine learning based APIextraction method that is not only effective but also easy-to-deploy across programming languages and libraries Tounderstand the challenges in achieving this objective andthe potential solution space we conduct empirical studies ofthe characteristics of API names API mentions in informaltexts and discussion contexts in which APIs are mentioned

We study six libraries three Python libraries Matplotlib(data visualization) Pandas (machine learning) Numpy(numeric computation) one C library OpenGL (computergraphics) one JavaScript library React (graphical user inter-face) and one Java library JDBC (database access) Theselibraries come from the four popular programming lan-guages and they support very diverse functionalities forcomputer programming

First we crawl API declarations of these libraries fromtheir official websites When different APIs have a samesimple name but different arguments in a same library wetreat such APIs as the same We examine each API nameto determine if the simple name of an API is a commonword (eg apply series draw) that can be found in a generalEnglish dictionary We find that different libraries havedifferent percentages of APIs with common-word simplenames OpenGL (6) Matplotlib (16) Numpy (41) JDBC(43) Pandas (55) React (66) When these APIs arementioned by their common-word simple names neithercharacter- nor word-level features can help to distinguishsuch polysemous API mentions from common words Wemust resort to discussion contexts of API mentions

Second by checking post tags we randomly sample200 Stack Overflow posts for each of the six libraries Wemanually label API mentions in these posts We examinethree characteristics of API mentions whether API mentionscontain explicit orthographic features (package or modulenames parentheses andor dot) whether API mentionsare long tokens (gt 10 characters) and whether the contextwindows (preceding and succeeding 5 words) around theAPI mentions contain common verbs and nouns (use call

TABLE 2 Statistics of API-Mention Characteristics

Library Orthographic Long tokens Commoncontext words

Matplotlib 6238 2156 2064Pandas 6711 3222 3423Numpy 6563 2687 2353Opengl 3373 3936 2080React 7556 2000 793JDBC 2636 6182 811

Average 5513 3364 1921

function method) Table 2 shows our analysis results Onaverage 5513 API mentions contain explicit orthographicand 3364 API mentions are long tokens Character-levelfeatures would be useful for recognizing these API men-tions However for the significant amount of API mentionsthat do not have such explicit character-level features weneed to resort to word- andor sentence-content featuresfor example the words (eg use call function method)that often appear in the context window of an API mentionto recognize API mentions

Furthermore we can observe that API-mention charac-teristics are not tightly coupled with a particular program-ming language or library Instead all six libraries exhibit cer-tain degree of the three API-mention characteristics But spe-cific degrees vary across libraries Fig 2 visualizes the top 50frequently-used words in the discussions of the six librariesWe can observe that discussions of different libraries sharecommon words but at the same time use library-specificwords (eg dataframe for Pandas matrix for Numpy figurefor Matplotlib query for JDBC render for OpenGL eventfor React) The commonalities of API mention characteristicsacross libraries indicate the feasibility of transfer learningFor example orthographic word andor sentence-contextfeatures learned from a source library could be applicableto a target library However due to the variations of API-name API-mention and discussion-context characteristicsacross libraries directly applying the source-library trainedmodel to the target library may suffer from performancedegradation unless the source and target libraries have verysimilar characteristics Therefore fine-tuning the source-library trained model with a small amount of target librarytext would be necessary

3 PROBLEM DEFINITION

In this work we takes as input informal software text (egStack Overflow posts) that discusses the usage and issuesof a particular library We assume that the software textof multiple programming languages and libraries need tobe processed Given a paragraph of informal software textour task is to recognize all API mentions (if any) in theparagraph as illustrated in the example in Fig 1 APImentions refer to tokens in the paragraph that representpublic modules classes methods or functions of a particularlibrary To preserving the integrity of code-like tokens anAPI mention is defined as a single token rather than a spanof tokens when the given text is tokenized properly

As our input is informal software text APIs may not beconsistently mentioned in their formal full names InsteadAPIs may be mentioned in abbreviations or synonymssuch as pandasrsquos dataframe for panadsDataFrame dfapply for

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 4

Fig 2 Word Clouds of the Top 50 Frequent Words in the Discussions of the Six Libraries

pandasDataFrameapply() APIs may also be mentioned intheir simple names such as apply series dataframe that canalso be common English words or computing terms in thetext Furthermore we do not assume that API mentions willbe consistently annotated with special tags Therefore ourapproach takes plain text as input

A related task to our work is API linking API extractionmethods classify whether a token in text is an API mentionor not while API linking methods link API mentions intext to API entities in a knowledge base [15] That is APIextraction is the prerequsite for API linking This work dealswith only API extraction

4 OUR NEURAL ARCHITECTURE

We formulate the task of API extraction in informal soft-ware texts as a sequence labeling problem and presenta neural architecture that labels each token in an inputtext as API or non-API As shown in Fig 3 our neuralarchitecture is composed of four main components 1) acharacter-level Convolutional Neural Network (CNN) forextracting character-level features of a token (Section 41)2) an unsupervised word embedding model for learningword semantics of a token (Section 42) 3) a BidirectionalLong Short-Term Memory network (Bi-LSTM) for extractingsentence-context features (Section 43) and 4) a softmaxclassifier for predicting the API or non-API label of a token(Section 44) Our neural model can be trained end-to-endwith pairs of input texts and their corresponding APInon-API label sequences A model trained with one libraryrsquos textcan be transferred to another libraryrsquos text by fine-tuning

Fig 3 Our Neural Architecture for API Extraction

source-library-trained components with the target libraryrsquostext (Section 45)

41 Extracting Char-Level Features by CNN

API mentions often have morphological features that dis-tinguish them from normal English words Such morpho-logical features may appear in the beginning of a token(eg the first letter capitalization System) in the middle(eg the hyphen in read csv the dot in pandasseries the left

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 5

Fig 4 Our Character Vocabulary

Fig 5 Character-Level CNN

parenthesis and comma in print(a32)) or at the end (egthe right parenthesis in apply()) Morphological features mayalso appear in combination such as camelcase writing likeAbstractAction a pair of parentheses like plot() The longlength of some tokens like createdataset is one importantmorphological feature as well Due to the lack of universalnaming convention across libraries and the wide presenceof informal writing forms informative morphological fea-tures of API mentions often vary from one libraryrsquos text toanother

Robust methods to extract morphological features fromtokens must take into account all characters of the tokenand determine which features are more important for a par-ticular libraryrsquos APIs [16] To that end we use a character-level CNN [17] which extracts local features in N-gramcharacters of the token using a convolution operation andthen combines them using a max-pooling operation to createa fixed-sized character-level embedding of the token [18][19]

Let V char be the vocabulary of characters for the soft-ware texts from which we want to extract API mentionsIn this work V char for all of our models consists of 92characters including 26 English letters (both upper andlower case) 10 digits 30 other characters (eg rsquorsquorsquo[rsquorsquorsquo)as listed in Fig 4 Note that V char can be easily extendedfor different datasets Let Echar isin Rdchartimes|V char| be thecharacter embedding matrix where dchar is the dimension ofcharacter embeddings and |V char| is the vocabulary size (92in this work) As illustrated in Fig 3 Echar can be regardedas a dictionary of character embeddings in which a columndchar-dimensional vector corresponds to a particular char-acter The character embeddings are initialized as one-hotvectors and then learned during the training of character-level CNN Given a character c isin V char its embedding ec

can be retrieved by the matrix-vector product ec = Echarvc

where vc is a one-hot vector of size |V char| which has value1 at index c and zero in all other positions

Fig 5 presents the architecture of our character-levelCNN Given a token w in the input text letrsquos assume wis composed of M characters c1 c2 cM We first obtaina sequence of character embeddings ec1 ec2 ecM by

looking up the character embeddings matrix Echar Thissequence of character embeddings (zero-padding at thebeginning and the end of the sequence) is the input to ourcharacter-level CNN In our application of CNN becauseeach character is represented as a dchar-dimensional vectorwe use convolution filters with widths equal to the dimen-sionality of the character embeddings (ie dchar) Then wecan vary the height h (or window size) of the filter iethe number of adjacent characters considered jointly in theconvolution operation

Let zm be the concatenation of the character embeddingsof cm (1 le m le M) the (h minus 1)2 left neighbors of cmand the (h minus 1)2 right neighbors of cm A convolutionoperation involves a filter W isin Rhdchar

(a matrix of htimesdchar

dimensions) and a bias term b isin Rh which is appliedrepeatedly to each character window of size h in the inputsequence c1 c2 cM

om = ReLU(WT middot zm + b)

where ReLU(x) = max(0 x) is the non-linear activa-tion function The convolution operations produce a M -dimensional feature map for a particular filter A 1D-maxpooling operation is then applied to the feature map toextract a scalar (ie a feature vector of length 1) with themaximum value in the M dimensions of the feature map

The convolution operation extracts local features withineach character window of the given token and using themax over all character windows of the token we extracta global feature for the token We can apply N filters toextract different features from the same character windowThe output of the character-level CNN is an N -dimensionalfeature vector representing the character-level embedding ofthe given token We denote this embedding as echarw for thetoken w

In our character-level CNN the matrices Echar and W and the vector b are parameters to be learned The dimen-sionality of the character embedding dchar the number offilters N and the window size of the filters h are hyper-parameters to be chosen by the user (see Section 52 formodel configuration)

42 Learning Word Semantics by Word Embedding

In informal software texts the same API is often mentionedin many non-standard abbreviations and synonyms [5] Forexample the Pandas library is often written as pd and itsmodule DataFrame is often written as df Furthermore thereis lack of consistent use of verb noun and preposition inthe discussions [2] For example in the sentences ldquoI havedecided to use apply rdquo ldquoif you run apply on a series rdquoand ldquoI tested with apply rdquo users refer to a Pandasrsquos methodapply() but their descriptions vary greatly

Such variations result in out-of-vocabulary (OOV) issuefor a machine learning model [20] ie variations that havenot been seen in the training data For the easy deploymentof a machine learning model for API extraction it is imprac-tical to address the OOV issue by manually labeling a hugeamount of data and developing a comprehensive gazetteerof API names and their common variations [2] Howeverwithout the knowledge about variations of semantically

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 6

similar words the trained model will be restricted to theexamples that it sees in the training data

To address this dilemma we propose to exploit unsu-pervised word-embedding models [21] [22] [23] [24] tolearn distributed word representations from a large amountof unlabeled text discussing a particular library Manystudies [25] [26] [27] have shown that distributed wordrepresentations can capture rich semantics of words suchthat semantically similar words will have similar wordembeddings

Let V word be the vocabulary of words for the corpusof software texts to be processed As illustrated in Fig 3word-level embeddings are encoded by column vectors ina word embedding matrix Eword isin Rdwordtimes|V word| wheredword is the dimension of word embeddings and |V word|is the vocabulary size Each column Eword

i isin Rdword

corre-sponds to the word-level embedding of the i-th word inthe vocabulary V word We can obtain a token wrsquos word-level embedding eword

w by looking up Eword with the wordw ie eword

w = Ewordvw where vw is a one-hot vector ofsize |V word| which has value 1 at index w and zero in allother positions The matrix Eword is to be learned usingunsupervised word embedding models (eg GloVe [28])and dword is a hyper-parameter to be chosen by the user(see Section 52 for model configuration)

In this work we adopt the Global Vectors for WordRepresentation (GloVe) method [28] to learn the matrixEword GloVe is an unsupervised algorithm for learningword representations based on the statistics of word co-occurrences in an unlabeled text corpus It calculates theword embeddings based on a word co-occurrence matrixX Each row in the word co-occurrence matrix correspondsto a word and each column corresponds to a context Xij

is the frequency of word i co-occurring with word j andXi =

983123Xik (1 le k le |V word|) is the total number of occur-

rences of word i in the corpus The probability of word j thatoccurs in the context of word i is Pij = P (j|i) = XijXiWe have log(Pij) = log(Xij)minus log(Xi)

GloVe defines log(Pij) = eTwiewj where ewi and ewj

are the word embeddings to be learned for the word wi

and wj This gives the constraint for each word pair aslog(Xij) = eTwi

ewj + bi + bj where b is the bias termfor ew The cost function for minimizing the loss of wordembeddings is defined as

V word983131

ij=1

f(Xij)(eTwiewj

+ bi + bj minus log(Xij))

where f(Xij) is a weighting function That is GloVe learnsword embeddings by a weighted least square regressionmodel

43 Extracting Sentence-Context Features by Bi-LSTM

In informal software texts many API mentions cannot be re-liably recognized by simply examining a tokenrsquos character-level features and word semantics This is because manyAPIs are named using common English words (eg seriesapply plot) or common computing terms (eg dataframesigmoid histgram argmax zip list) When such APIs are men-tioned in their simple name this results in a common-word

polesemy issue for API extraction [2] In such situations wehave to disambiguate the API sense of a common word fromthe normal sense of the word

To that end we have to look into the sentence context inwhich an API is mentioned For example by looking intothe sentence context of the two sentences in Fig 1 we candetermine that the ldquoapplyrdquo in the first sentence is an APImention while the ldquoapplyrdquo in the second sentence is not anAPI mention Note that both the preceding and succeedingcontext of the token ldquoapplyrdquo are useful for disambiguatingthe API or the normal sense of the token

We use Recurrent Neural Network (RNN) to extractsentence context features for disambiguating the API or thenormal sense of the word [29] RNN is a class of neuralnetworks where connections between units form directedcycles and it is widely used in software engineering do-main [30] [31] [32] [33] Due to this nature it is especiallyuseful for tasks involving sequential inputs [29] like sen-tences In our task we adopt a Bidirectional RNN (Bi-RNN)architecture [34] [35] which is composed of two LSTMsone takes input from the beginning of the text forward tilla particular token while the other takes input from the endof the text backward till that token

The input to an RNN is a sequence of vectors In ourtask we obtain the input vector of a token in the inputtext by concatenating the character-level embedding of thetoken w and the word-level embedding of the token wie echarw oplus eword

w An RNN recursively maps an inputvector xt and a hidden state htminus1 to a new hidden state htht = f(htminus1 xt) where f is a non-linear activation function(eg an LSTM unit used in this work) A hidden stateis a vector esent isin Rdsent

summarizing the sentence-levelfeatures till the input xt where dsent is the dimension of thehidden state vector to be chosen by the user We denote efsentand ebsent as the hidden states computed by the forwardLSTM and the backward LSTM after reading the end of thepreceding and the succeeding sentence context of a tokenrespectively efsent and ebsent are concatenated into one vectoresentw as the Bi-LSTM output for the corresponding token win the input text

As an input text (eg a Stack Overflow post) can be along text modeling long-range dependencies in the text iscrucial for our task For example a mention of a libraryname at the beginning of a post could be important fordetecting a mention of a method of this library later inthe post Therefore we adopt the LSTM unit [36] [37] inour RNN The LSTM is designed to cope with the gradientvanishing problem in RNN An LSTM unit consists of amemory cell and three gates namely the input outputand forget gates Conceptually the memory cell stores thepast contexts and the input and output gates allow thecell to store contexts for a long period of time Meanwhilesome contexts can be cleared by the forget gate Memorycell and the three gates have weights and bias terms tobe learned during model training Bi-LSTM can extract thecontext feature For instance in the Pandas library sentencerdquoThis can be accomplished quite simply with the DataFramemethod applyrdquo based on the context information from theword method the Bi-LSTM can help our model classifythe word apply as API mention and the learning can be

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 7

transferred across different languages and libraries withtransfer learning

44 API Labeling by Softmax ClassificationGiven the Bi-LSTMrsquos output vector esentw for a token w inthe input text we train a binary softmax classifier to predictwhether the token is an API mention or a normal Englishword That is we have two classes in this work ie API ornon-API Softmax predicts the probability for the j-th classgiven a tokenrsquos vector esentw by

P (j|esentw ) =exp(esentw WT

j )9831232

k=1 exp(esentw WT

k )

where the vectors Wk (k=1 or 2) are parameters to belearned

45 Library Adaptation by Transfer LearningTransfer learning [12] [38] is a machine learning methodthat stores knowledge gained while solving one problemand applying it to a different but related problem It helpsto reduce the amount of training data required for the targetproblem When transfer learning is used in deep learningit has been shown to be beneficial for narrowing down thescope of possible models on a task by using the weights ofa trained model on a different but related task [14] and theshared parameters transfer learning can help model adaptthe shared knowledge in similar context [39]

API extraction tasks for different libraries can be re-garded as a set of different but related tasks Therefore weuse transfer learning to adapt a neural model trained withone libraryrsquos text (source-library-trained model) for anotherlibraryrsquos text (target library) Our neural architecture is com-posed of four main components character-level CNN wordembeddings sentence Bi-LSTM and softmax classifier Wecan transfer all or some source-library-trained model com-ponents to a target library model Without transfer learn-ing the parameters of a target-library model componentwill be randomly initialized and then learned using thetarget-library training data ie trained from scratch Withtransfer learning we use the parameters of source-library-trained model components to initialize the parameters ofthe corresponding target-library model components Aftertransferring the model parameters we can either freeze thetransferred parameters or fine-tune the transferred parame-ters using the target-library training data

5 SYSTEM IMPLEMENTATION

This section describes the current implementation 2 of ourneural architecture for API extraction

51 Preprocessing and Tokenizing Input TextOur current implementation takes as input the content of aStack Overflow post As we want to recognize API mentionswithin natural language sentences we remove stand-alonecode snippets in ltpregtltcodegt tag We then remove allHTML tags in the post content to obtain the plain text

2 httpsgithubcomJOJO201API Extraction

input (see Section 3 for the justification of taking plain textas input) We develop a sentence parser to split the pre-processed post text into sentences by punctuation Follow-ing [2] we develop a software-specific tokenizer to tokenizethe sentences This tokenizer preserves the integrity of code-like tokens For example it treats matplotlibpyplotimshow()as a single token instead of a sequence of 7 tokens ieldquomatplotlibrdquo ldquordquo ldquopyplotrdquo ldquordquo ldquoimshowrdquo ldquo(rdquo ldquo)rdquo producedby general English tokenizers

52 Model ConfigurationWe now describe the hyper-parameter settings used in ourcurrent implementation These hyper-parameters are tunedusing the validation data during model training (see Sec-tion 612 for the description of our dataset) We find thatthe neural model has very similar performance across sixlibraries dataset with the same hyper-parameters Thereforewe keep the same hyper-parameters for the six librariesdataset which can also avoid the difficulty in scaling dif-ferent hyper-parameters

521 Character-level CNNWe set the filter window size h = 3 That is the convolutionoperation extracts local features from 3 adjacent charactersat a time The size of our current character vocabulary V char

is 92 Thus we set the dimensionality of the character em-bedding dchar at 92 We initialize the character embeddingwith one-hot vector (1 at one character index and zeroin all other dimensions) The character embeddings willbe updated through back propagation during the trainingof character-level CNN We experiment 5 different N (thenumber of filters) 20 40 60 80 100 With N = 40 theCNN has an acceptable performance on the validation dataWith N = 60 and above the CNN has almost the sameperformance as N = 40 but it takes more training epochsto research the acceptable performance Therefore we useN = 40 That is the character-level embedding of a tokenhas the dimension 40

522 Pre-trained word embeddingsOur experiments involve six Python libraries Pandas Mat-plotlib NumPy OpenGL React and JDBC We collect allquestions tagged with these six libraries and all answers tothese questions in the Stack Overflow Data Dump releasedon March 18 2018 We obtain a text corpus of 380971 postsWe use the same preprocessing and tokenization steps asdescribed in Section 51 to preprocess and tokenize thecontent of these posts Then we use the GloVe [28] to learnword embeddings from this text corpus We set the wordvocabulary size |V word| at 40000 The training epoch is setat 100 to ensure the sufficient training of word embeddingsWe experiment four different dimensions of word embed-dings dword 50 100 200 400 We use dword = 200 inour current implementation as it produces a good balanceof the training efficiency and the quality of word embed-dings for API extraction on the validation data We alsoexperiment pre-trained word embeddings with all StackOverflow posts which does not significantly affect the APIextraction performance but requires much longer time fortext preprocessing and word embeddings learning

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 8

523 Sentence-context Bi-LSTMFor the RNN we use 50 hidden LTSM units to store thehidden states The dimension of the hidden state vector is50 Therefore the dimension of the output vector esentw is100 (concatenating forward and backward hidden states) Inorder to mitigate overfitting [40] a dropout layer is addedon the output of BLSTM with the dropout rate 05

53 Model TrainingTo train our neural model we use input text and itscorresponding sequence of APInon-API labels (see Sec-tion 612 for our data labeling process) The optimizerused is Adam [41] which performs well for training RNNincluding Bi-LSTM [35] The training epoch is set at 40 timesThe best performance model on the validation set is savedfor testing This modelrsquos parameters are also saved for thetransfer learning experiments

6 EXPERIMENTS

We conduct a series of experiments to answer the followingfour research questionsbull RQ1 How well can our neural architecture learn multi-

level features from the input texts Can the learned fea-tures support high-quality API extraction compared withexisting machine learning based API extraction methods

bull RQ2 What is the impact of the three feature extractors(character-level CNN word embeddings and sentence-context Bi-LSTM) on the API extraction performance

bull RQ3 How effective is transfer learning for adapting APIextraction models across libraries of the same languagewith different amount of target-library training data

bull RQ4 How effective is transfer learning for adapting APIextraction models across libraries of different languageswith different amount of target-library training data

61 Experiments SetupThis section describes the libraries used in our experimentshow we prepare training and testing data and the evalua-tion metrics of API extraction performance

611 Studied librariesOur experiments involve three Python libraries (PandasNumPy and Matplotlib) one Java library (JDBC) oneJavaScript library (React) and one C library (OpenGL) Asreported in Section 2 these six libraries support very di-verse functionalities for computer programming and havedistinct API-naming and API mention characteristics Usingthese libraries we can evaluate the effectiveness of ourneural architecture for API extraction in very diverse datasettings We can also gain insights into the effectiveness oftransfer learning for API extraction and the transferabilityof our neural architecture in different settings

612 DatasetWe collect Stack Overflow posts (questions and their an-swers) tagged with pandas numpy matplotlib opengl reactor jdbc as our experimental data We use Stack OverflowData Dump released on March 18 2018 In this data dump

TABLE 3 Basic Statistics of Our Dataset

Library Posts Sentences API mentions TokensMatplotlib 600 4920 1481 47317

Numpy 600 2786 1552 39321Pandas 600 3522 1576 42267Opengl 600 3486 1674 70757JDBC 600 4205 1184 50861React 600 3110 1262 42282Total 3600 22029 8729 292805

380971 posts are tagged with one of the six studied librariesOur data collection process follows four criteria First thenumber of posts selected and the number of API mentionsin these posts should be at the same order of magnitudeSecond API mentions should exhibit the variations of APIwriting forms commonly seen on Stack Overflow Third theselection should avoid repeatedly selecting frequently men-tioned APIs Fourth the same post should not appear in thedataset for different libraries (one post may be tagged withtwo or more studied-library tags) The first criterion ensuresthe fairness of comparison across libraries the second andthird criteria ensure the representativeness of API mentionsin the dataset and the fourth criterion ensures that there isno repeated data in different datasets

We finally include 3600 posts (600 for each library) in ourdataset and each post has at least one API mention3 Table 3summarizes the basic statistics of our dataset These postshave in total 22029 sentences 292805 token occurrencesand 41486 unique tokens after data preprocessing and to-kenization We manually label the API mentions in theselected posts 6421 sentences (3407) contains at least oneAPI mention Our dataset has in total 8729 API mentionsThe selected Matplotlib NumPy Pandas OpenGL JDBC andReact posts have 1481 1552 1576 1674 1184 and 1262 APImentions which refer to 553 694 293 201 282 and 83unique APIs of the six libraries respectively In additionwe randomly sample 100 labelled Stack Overflow posts foreach library dataset Then we examine each API mention todetermine whether it is a simple name or not and we findthat among the 600 posts there are 1634 API mentions ofwhich 694 (4247) are simple API names Our dataset notonly contains a large number of API mentions in diversewriting forms but also contains rich discussion contextaround API mentions This makes it suitable for the trainingof our neural architecture the testing of its performanceand the study of model transferability

We randomly split the 600 posts of each library into threesubsets by 622 ratio 60 as training data 20 as validationdata for tuning model hyper-parameters and 20 as testingdata to answer our research questions Since cross validationwill cost a great deal of time and our training dataset is un-biased (proved in Section 64) cross validation is not usedOur experiment results also show that the training datasetis enough to train a high-performance neural networkBesides the performance of the model in target library canbe improved by adding more training dataset from otherlibraries And if more training dataset is added there is noneed to label a new validate dataset for the target libraryto re-tune the hyper-parameters because we find that the

3 httpsdrivegooglecomdrivefolders1f7ejNVUsew9l9uPCj4Xv5gMzNbqttpoausp=sharing

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 9

TABLE 4 Comparison of CRF Baselines [2] and Our Neural Model for API Extraction

Basic CRF Full CRF Our method

Library Prec Recall F1 Prec Recall F1 Prec Recall F1

Matplotlib 6735 4784 5662 8962 6131 7282 815 8392 827NumPy 7583 3991 5229 8921 4931 6351 7824 6291 6974Pandas 6206 7049 6601 9711 7121 8216 8273 853 8280Opengl 4391 9187 5942 9457 7062 8085 8583 8374 8477JDBC 1583 8261 2658 8732 5140 6471 8469 5531 6692React 1674 8858 2816 9742 7011 8153 8795 7817 8277

neural model has very similar performance with the samehyper-parameters under different dataset settings

613 Evaluation metricsWe use precision recall and F1-score to evaluate the per-formance of an API extraction method Precision measureswhat percentage the recognized API mentions by a methodare correct Recall measures what percentage the API men-tions in the testing dataset are recognized correctly by amethod F1-score is the harmonic mean of precision andrecall ie 2 lowast ((precision lowast recall)(precision+ recall))

62 Performance of Our Neural Model (RQ1)Motivation Recently linear-chain CRF has been used tosolve the API extraction problem in informal text [2] Theyshow that machine-learning based API extraction with thatof several outperforms several commonly-used rule-basedmethods [2] [42] [43] The approach in [2] relies on human-defined orthographic features two different unsupervisedlanguage models (class-based Brown clustering and neural-network based word embedding) and API gazetteer features(API inventory) In contrast our neural architecture usesneural networks to automatically learn character- word-and sentence-level features from the input texts The firstRQ is to confirm the effectiveness of these neural-networkfeature extractors for API extraction tasks by comparingthe overall performance of our neural architecture with theperformance of the linear-chain CRF with human-definedfeatures for API extractionApproach We use the implementation of the CRF modelin [2] to build the two CRF baselines The basic CRF baselineuses only the orthographic features developed in [2] Thefull CRF baseline uses all orthographic word-clusters andAPI gazetteer features developed in [2] The basic CRF base-line is easy to deploy because it uses only the orthographicfeatures in the input texts but not any advanced word-clusters and API gazetteer features And the self-trainingprocess is not used in these two baselines since we havesufficient training data In contrast the full CRF baselineuses advanced features so that it is not as easy-to-deploy asthe basic CRF But the full CRF has much better performancethan the basic CRF as reported in [2] Comparing our modelwith these two baselines we can understand whether ourmodel can achieve a good tradeoff between easy-to-deployand high performance Note that as Ye et al [2] alreadydemonstrates the superior performance of machine-learningbased API extraction methods over rule-based methods wedo not consider rule-based baselines for API extraction inthis studyResults From Table 4 we can see that

bull Although linear CRF with full features has close performance toour model linear CRF with only orthographic features performspoorly in distinguishing API tokens from non-API tokens Thebest F1-score of the basic CRF is only 066 for PandasThe F1-score of the basic CRF for Matplotlib Numpy andOpenGL is 052-059 For JDBC and React the F1-score ofthe basic CRF is below 03 Our results are consistent withthe results in [2] when advanced word-clusters and API-gazetteer features are ablated Compared with the basicCRF the full CRF performs much better For Pandas JDBCand React the full CRF has very close performance to ourmodel But for Matplotlib Numpy and OpenGL our modelstill outperforms the full CRF by at least four points inF1-score

bull Multi-level feature embedding by our neural architecture iseffective in distinguishing API tokens from non-API tokensin the resulting embedding space All evaluation metrics ofour neural architecture are significantly higher than thoseof the basic CRF Although our neural architecture per-forms relatively worse on NumPy and JDBC its F1-scoreis still much higher than the F1-score of the basic CRFon NumPy and JDBC We examine false positives (non-API token recognized as API) and false negatives (APItoken recognized as non-API) by our neural architectureon NumPy and JDBC We find that many false positivesand false negatives involve API mentions composed ofcomplex strings for example array expressions for Numpyor SQL statements for JDBC It seems that neural networkslearn some features from such complex strings that mayconfuse the model and cause the failure to tell apart APItokens from non-API ones Furthermore by analysing theclassification results for different API mentions we findthat our model has good generalizability on different APIfunctions

With linear CRF for API extraction we cannot achieve agood tradeoff between easy-to-deploy and high performance Incontrast our neural architecture can extract effective character-word- and sentence-level features from the input texts and theextracted features alone can support high-quality API extractionwithout using any hand-crafted advanced features such as wordclusters API gazetteers

63 Impact of Feature Extractors (RQ2)Motivation Although our neural architecture achieves verygood performance as a whole we would like to furtherinvestigate how much different feature extractors (character-level CNN word embeddings and sentence-context Bi-LSTM) contribute to this good performance and how dif-ferent features affect precision and recall of API extraction

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 10

TABLE 5 The Results of Feature Ablation Experiments

Ablating char CNN Ablating Word Embeddings Ablating Bi-LSTM All featuresPrec Recall F1-score Prec Recall F1-score Prec Recall F1-score Prec Recall F1-score

Matplotlib 7570 7232 7398 8175 6399 7179 8259 6934 7544 8150 8392 8270Numpy 8100 6040 6921 7931 5847 6731 8062 5667 6644 7824 6291 6974Pandas 8050 8328 8182 8177 6801 7425 8322 7522 7965 8273 8530 8280Opengl 7758 8537 8129 8307 6803 7504 9852 7208 8305 8583 8374 8477JDBC 6865 6125 6447 6422 6562 6491 9928 4312 6613 8469 5531 6692React 6990 8571 7701 8437 7500 7941 9879 6508 7847 8795 7817 8277

Approach We ablate one kind of feature at a time fromour neural architecture That is we obtain a model withoutcharacter-level CNN one without word embeddings andone without sentence-context Bi-LSTM We compare theperformance of these three models with that of the modelwith all three feature extractors

Results In Table 5 we highlight in bold the largest drop ineach metric for a library when ablating a feature comparedwith that metric with all features We underline the increasein each metric for a library when ablating a feature com-pared with that metric with all features We can see that

bull The performance of our neural architecture is contributed bythe combined action of all its features Ablating any of thefeatures the F1-score degrades Ablating word embed-dings or Bi-LSTM causes the largest drop in F1-scorefor four libraries while ablating char-CNN causes thelargest drop in F1-score for two libraries Feature ablationhas higher impact on some libraries than others Forexample ablating char-CNN word embeddings and Bi-LSTM all cause significant drop in F1-score for Matplotliband ablating word embeddings causes significant dropin F1-score for Pandas and React In contrast ablating afeature causes relative minor drop in F1-score for NumpyJDBC and React This indicates different levels of featureslearned by our neural architecture can be more distinct forsome libraries but more complementary for others Whendifferent features are more distinct from one anotherachieving good API extraction performance relies moreon all the features

bull Different features play different roles in distinguishing APItokens from non-API tokens Ablating char-CNN causesthe drop in precision for five libraries except NumpyThe precision degrades rather significantly for Matplotlib(71) OpenGL (96) JDBC (189) and React (205)In contrast ablating char-CNN causes significant drop inrecall only for Matplotlib (138) For JDBC and React ab-lating char-CNN even causes significant increase in recall(107 and 96 respectively) Different from the impactof ablating char-CNN ablating word embeddings and Bi-LSTM usually causes the largest drop in recall rangingfrom 99 drop for Numpy to 237 drop for MatplotlibAblating word embeddings causes significant drop inprecision only for JDBC and it causes small changes inprecision for the other five libraries (three with small dropand two with small increase) Ablating Bi-LSTM causesthe increase in precision for all six libraries Especiallyfor OpenGL JDBC and React when ablating Bi-LSTM themodel has almost perfect precision (around 99) but thiscomes with a significant drop in recall (139 for OpenGL193 for JDBC and 167 React)

All feature extractors are important for high-quality API ex-traction Char-CNN is especially useful for filtering out non-API tokens while word embeddings and Bi-LSTM are especiallyuseful for recalling API tokens

64 Effectiveness of Within-Language Transfer Learn-ing (RQ3)Motivation As described in Section 611 we intentionallychoose three different-functionality Python libraries PandasMatplotlib NumPy Pandas and NumPy are functionally sim-ilar while the other two pairs are more distant The threelibraries also have different API-naming and API-mentioncharacteristics (see Section 2) We want to investigate theeffectiveness of transfer learning for API extraction acrossdifferent pairs of libraries and with different amount oftarget-library training dataApproach We use one library as source library and one ofthe other two libraries as target library We have six transfer-learning pairs for the three libraries We denote them assource-library-name rarr target-library-name such as Pandas rarrNumPy which represents transferring Pandas-trained modelto NumPy text Two of these six pairs (Pandas rarr NumPy andNumPy rarr Pandas) are similar-libraries transfer (in terms oflibrary functionalities and API-namingmention character-istics) while the other four pairs involving Matplotlib arerelatively more-distant-libraries transfer

TABLE 6 NumPy (NP) or Pandas (PD) rarr Matplotlib (MPL)NPrarrMPL PDrarrMPL MPL

Prec Recall F1 Prec Recall F1 Prec Recall F111 8264 8929 8584 8102 8006 8058 8767 7827 827012 8184 8452 8316 7161 8333 7696 8138 7024 754014 7183 7589 7381 6788 7798 7265 8122 5536 658418 7056 7560 7298 6966 7381 7171 7500 5357 6250116 7356 7202 7278 6648 7202 6916 8070 2738 4089132 7256 7883 7369 7147 6935 7047 9750 1160 2074DU 7254 6607 6916 7699 5476 6400

TABLE 7 Matplotlib (MPL) or Pandas (PD) rarr NumPy (NP)MPLrarrNP PDrarrNP NP

Prec Recall F1 Prec Recall F1 Prec Recall F111 8685 7708 8168 7751 6750 7216 7824 6292 697412 788 8208 8041 7013 6751 6879 7513 6042 669714 9364 7667 8000 6588 7000 6781 7714 4500 568418 7373 7833 7596 6584 6667 6625 7103 4292 5351116 7619 6667 7111 5833 6417 6114 5707 4875 5258132 7554 5792 6556 6027 5625 5823 7272 2333 3533DU 6416 6525 6471 6296 5932 6108

We use gradually-reduced target-library training data(1 for all data 12 14 132) to fine-tune the source-library-trained model We also train a target-library modelfrom scratch (ie with randomly initialized model parame-ters) using the same proportion of the target-library trainingdata for comparison We also use the source-library-trained

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 11

TABLE 8 Matplotlib (MPL) or NumPy (NP) rarr Pandas (PD)MPLrarrPD NPrarrPD PD

Prec Recall F1 Prec Recall F1 Prec Recall F111 8497 8962 8723 8686 8761 8723 8043 8530 828012 8618 8443 8530 8397 8300 8348 8801 7406 804314 8107 8761 8421 8107 8270 8188 7650 7695 767218 8754 8098 8413 8576 7637 8079 6930 8329 7565

116 8204 7896 8047 8245 7579 7898 8421 5072 5331132 8131 7522 7814 8143 7205 7645 8425 3084 4514DU 7165 4006 5139 7500 3545 4814

TABLE 9 Averaged Matplotlib (MPL) or NumPy(NP)rarrPandas (PD)

MPLrarrPD NPrarrPDPrec Recall F1 Prec Recall F1

116 8333 7781 8043 8339 7522 7909132 7950 7378 7653 8615 7115 7830

model directly without any fine-tuning (ie 01 target-library data) as a baseline We also randomly select 116 and132 training data for NumPy rarr Pandas and Matplotlib rarrPandas for 10 times and train the model for each time Thenwe calculate the averaged precision and recall values Basedon the averaged precision and recall values we calculate theaveraged F1-scoreResults Table 6 Table 7 and Table 8 show the experimentresults of the six pairs of transfer learning Table 9 showsthe averaged experiment results of NumPy rarr Pandas andMatplotlib rarr Pandas with 116 and 132 training dataAcronyms stand for MPL (Matplotlib) NP (NumPy) PD(Pandas) DU (Direct Use) The last column is the results oftraining the target-library model from scratch We can seethatbull Transfer learning can produce a better-performance target-

library model than training the target-library model fromscratch with the same proportion of training data This is ev-ident as the F1-score of the target-library model obtainedby fine-tuning the source-library-trained model is higherthan the F1-score of the target-library model trained fromscratch in all the transfer settings but PD rarr MPL at11 Especially for NumPy the NumPy model trainedfrom scratch with all Numpy training data has F1-score6974 However the NumPy model transferred from theMatplotlib model has F1-score 8168 (171 improvement)For the Matplotlib rarr Nump transfer even with only 116Numpy training data for fine-tuning the F1-score (7111) ofthe transferred NumPy model is still higher than the F1-score of the NumPy model trained from scratch with allNumpy training data This suggests that transfer learningmay boost the model performance even for the difficultdataset Such performance boost by transfer learning hasalso been observed in many studies [14] [44] [45] Thereason is that a target-library model can ldquoreuserdquo muchknowledge in the source-library model rather than hav-ing to learning completely new knowledge from scratch

bull Transfer learning can reduce the amount of target-library train-ing data required to obtain a high-quality model For four outof six transfer-learning pairs (NP rarr MPL MPL rarr NP NPrarr PD MPL rarr PD) the reduction ranges from 50 (12)to 875 (78) while the resulting target-library model stillhas better F1-score than the target-library model trainedfrom scratch using all target-library training data If we al-low a small degradation of F1-score (eg 3 of the F1-sore

for the target-library model trained from scratch usingall target-library training data) the reduction can go upto 938 (1516) For example for Matplotlib rarr Pandasusing 116 Pandas training data (ie only about 20 posts)for fine-tuning the obtained Pandas model still has F1-score 8047

bull Transfer learning is very effective in few-shot training settingsFew-shot training refers to training a model with onlya very small amount of data [46] In our experimentsusing 132 training data ie about 10 posts for transferlearning the F1-score of the obtained target-library modelis still improved a few points for NP rarr MPL PD rarr MPLand MPL rarr NP and is significantly improved for MPL rarrPD (529) and NP rarr PD (583) compared with directlyreusing source-library-trained model without fine-tuning(DU row) Furthermore the averaged results of NP rarrPD and MPL rarr PD for 10 randomly selected 116 and132 training data are similar to the results in Table 8and the variances are all smaller than 03 which showsour training data is unbiased Although the target-librarymodel trained from scratch still has reasonable F1-scorewith 12 to 18 training data they become completelyuseless with 116 or 132 training data In contrast thetarget-library models obtained through transfer learninghave significant better F1-score in such few shot settingsFurthermore training a model from scratch using few-shot data may result in abnormal increase in precision(eg MPL at 116 and 132 NP at 132 PD at 132) or inrecall (eg NP at 116) compared with training the modelfrom scratch using more data This abnormal increase inprecision (or recall) comes with a sharp decrease in recall(or precision) Our analysis shows that this phenomenonis caused by the biased training data in few-shot settingsIn contrast the target-library models obtained throughtransfer learning can reuse knowledge in the source modeland produce a much more balanced precision and recall(thus much better F1-score) even in the face of biased few-shot training data

bull The effectiveness of transfer learning does not correlate withthe quality of source-library-trained model Based on a not-so-high-quality source model (eg NumPy) we can still ob-tain a high-quality target model through transfer learningFor example NP rarr PD results in a Pandas model with F1-score gt 80 using only 18 of Pandas training data On theother hand a high-quality source model (eg Pandas) maynot boost the performance of the target model Amongall 36 transfer settings we have one such case ie PDrarr MPL at 11 The Matplotlib model transferred fromthe Pandas model using all Matplotlib training data isslightly worse than the Matplotlib model trained fromscratch using all training data This can be attributed tothe differences of API naming characteristics between thetwo libraries (see the last bullet)

bull The more similar the functionalities and the API-namingmention characteristics between the two libraries arethe more effectiveness the transfer learning can be Pan-das and NumPy have similar functionalities and API-namingmention characteristics For PD rarr NP and NPrarr PD the target-library model is less than 5 worse inF1-score with 18 target-library training data comparedwith the target-library model trained from scratch using

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 12

all training data In contrast PD rarr MPL has 69 dropin F1-score with 12 training data and NP rarr MPL has107 drop in F1-score with 14 training data comparedwith the Matplotlib model trained from scratch using alltraining data

bull Knowledge about non-polysemous APIs seems easy to expandwhile knowledge about polysemous APIs seems difficult toadapt Matplotlib has the least polysemous APIs (16)and Pandas and NumPy has much more polysemous APIsBoth MPL rarr PD and MPL rarr NP have high-qualitytarget models even with 18 target-library training dataThe transfer learning seems to be able to add the newknowledge ldquocommon words can be APIsrdquo to the target-library model In contrast NP rarr MPL requires at 12training data to have a high-quality target model (F1-score8316) and for PD rarr MPL the Matplotlib model (F1-score8058) by transfer learning with all training data is evenslightly worse than the model trained from scratch (F1-score 827) These results suggest that it can be difficultto adapt the knowledge ldquocommon words can be APIsrdquolearned from NumPy and Pandas to Matplotlib which doesnot have many common-word APIs

Transfer learning can effectively boost the performance of target-library model with less demand on target-library training dataIts effectiveness correlates more with the similarity of libraryfunctionalities and and API-namingmention characteristicsthan the initial quality of source-library model

65 Effectiveness of Across-Language Transfer Learn-ing (RQ4)Motivation In the RQ3 we investigate the transferabilityof API extraction model across three Python libraries Inthe RQ4 we want to further investigate the transferabilityof API extraction model in a more challenging setting ieacross different programming languages and across librarieswith very different functionalitiesApproach For the source language we choose Python oneof the most popular programming languages We randomlyselect 200 posts in each dataset of the three Python librariesand combine these 600 posts as the training data for PythonAPI extraction model For the target languages we choosethree other popular programming languages Java JavaScriptand C That is we have three transfer-learning pairs in thisRQ Python rarr Java Python rarr JavaScript and Python rarrC For the three target languages we intentionally choosethe libraries that support very different functionalities fromthe three Python libraries For Java we choose JDBC (anAPI for accessing relational database) For JavaScript wechoose React (a library for web graphical user interface)For C we choose OpenGL (a library for 3D graphics) Asdescribed in Section 612 we label 600 posts for each target-language library for the experiments As in the RQ3 weuse gradually-reduced target-language training data to fine-tune the source-language-trained modelResults Table 10 Table 11 and Table 12 show the experimentresults for the three across-language transfer-learning pairsThese three tables use the same notation as Table 6 Table 7and Table 8 We can see thatbull Directly reusing the source-language-trained model on the

target-language text produces unacceptable performance For

TABLE 10 Python rarr Java

PythonrarrJava JavaPrec Recall F1 Prec Recall F1

11 7745 6656 7160 8469 5531 669212 7220 6250 6706 7738 5344 632214 6926 5000 5808 7122 4719 567718 5000 6406 5616 7515 3969 5194

116 5571 4825 5200 7569 3406 4698132 5699 4000 4483 7789 2312 3566DU 4444 2875 3491

TABLE 11 Python rarr JavaScript

PythonrarrJavaScript JavaScriptPrec Recall F1 Prec Recall F1

11 7745 8175 8619 8795 7817 827712 8693 7659 8143 8756 5984 777014 8684 6865 7424 8308 6626 737218 8148 6111 6984 8598 5595 6788

116 7111 6349 6808 8738 3848 5344132 6667 5238 5867 6521 3571 4615DU 5163 2500 3369

the three Python libraries directly reusing a source-library-trained model on the text of a target-library maystill have reasonable performance for example NP rarrMPL (F1-score 6916) PD rarr MPL (640) MPL rarr NP(6471) However for the three across-language transfer-learning pairs the F1-score of direct model reuse is verylow Python rarr Java (3491) Python rarr JavaScript (3369)Python rarr C (3595) These results suggest that there arestill certain level of commonalities between the librarieswith similar functionalities and of the same programminglanguage so that the knowledge learned in the source-library model may be largely reused in the target-librarymodel In contrast the libraries of different functionali-ties and programming languages have much fewer com-monalities which makes it infeasible to directly deploya source-language-trained model to the text of anotherlanguage In such cases we have to either train the modelfor each language or library from scratch or we may ex-ploit transfer learning to adapt a source-language-trainedmodel to the target-language text

bull Across-language transfer learning holds the same performancecharacteristics as within-language transfer learning but itdemands more target-language training data to obtain a high-quality model than within-language transfer learning ForPython rarr JavaScript and Python rarr C with as minimumas 116 target-language training data we can obtaina fine-tuned target-language model whose F1-score candouble that of directly reusing the Python-trained modelto JavaScript or C text For Python rarr Java fine-tuning thePython-trained model with 116 Java training data booststhe F1-score by 50 compared with directly applying thePython-trained model to Java text For all the transfer-learning settings in Table 10 Table 11 and Table 12the fine-tuned target-language model always outperformsthe corresponding target-language model trained fromscratch with the same amount of target-language trainingdata However it requires at least 12 target-languagetraining data to fine-tune a target-language model toachieve the same level of performance as the target-language model trained from scratch with all target-

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 13

TABLE 12 Python rarr C

PythonrarrC CPrec Recall F1 Prec Recall F1

11 8987 8509 8747 8583 8374 847712 8535 8373 8454 8628 7669 812114 8383 7588 7965 8219 7127 763418 7432 7371 7401 7868 6802 7297

116 7581 6945 7260 8852 5710 6942132 6962 6585 6779 8789 4525 5975DU 5649 2791 3595

language training data For the within-language transferlearning achieving this result may require as minimum as18 target-library training data

Software text of different programming languages and of li-braries with very different functionalities increases the difficultyof transfer learning but transfer learning can still effectivelyboost the performance of the target-language model with 12less target-language training data compared with training thetarget-language model from scratch with all training data

66 Threats to Validity

The major threat to internal validity is the API labelingerrors in the dataset In order to decrease errors the first twoauthors first independently label the same data and thenresolve the disagreements by discussion Furthermore inour experiments we have to examine many API extractionresults We only spot a few instances of overlooked orerroneously-labeled API mentions Therefore the quality ofour dataset is trustworthy

The major threat to external validity is the generalizationof our results and findings Although we invest significanttime and effort to prepare datasets conduct experimentsand analyze results our experiments involve only six li-braries of four programming languages The performance ofour neural architecture and especially the findings on trans-fer learning could be different with other programminglanguages and libraries Furthermore the software text inall the experiments comes from a single data source ieStack Overflow Software text from other data sources mayexhibit different API-mention and discussion-context char-acteristics which may affect the model performance In thefuture we will reduce this threat by applying our approachto more languageslibraries and informal software text fromother data sources (eg developer emails)

7 RELATED WORK

APIs are the core resources for software development andAPI related knowledge is widely present in software textssuch as API reference documentation QampA discussionsbug reports Researchers have proposed many API extrac-tion methods to support various software engineering tasksespecially document traceability recovery [42] [47] [48][49] [50] For example RecoDoc [3] extracts Java APIsfrom several resources and then recover traceability acrossdifferent sources Subramanian et al [4] use code contextinformation to filter candidate APIs in a knowledge basefor an API mention in a partial code fragment Christophand Robillard [51] extracts sentences about API usage from

Stack Overflow and use them to augment API referencedocumentation These works focus mainly on the API link-ing task ie linking API mentions in text or code to someAPI entities in a knowledge base In terms of extracting APImentions in software text they rely on rule-based methodsfor example regular expressions of distinct orthographicfeatures of APIs such as camelcase special characters (eg or ()) and API annotations

Several studies such as Bacchelli et al [42] [43] and Yeet al [2] show that regular expressions of distinct ortho-graphic features are not reliable for API extraction tasks ininformal texts such as emails Stack Overflow posts Bothvariations of sentence formats and the wide presence ofmentions of polysemous API simple names pose a greatchallenge for API extraction in informal texts Howeverthis challenge is generally avoided by considering onlyAPI mentions with distinct orthographic features in existingworks [3] [51] [52]

Island parsing provides a more robust solution for ex-tracting API mentions from texts Using an island grammarwe can separate the textual content into constructs of in-terest (island) and the remainder (water) [53] For exampleBacchelli et al [54] uses island parsing to extract code frag-ments from natural language text Rigby and Robillard [52]also use island parser to identify code-like elements that canpotentially be APIs However these island parsers cannoteffectively deal with mentions of API simple names that arenot suffixed by () such as apply and series in Fig 1 whichare very common writing forms of API methods in StackOverflow discussions [2]

Recently Ye et al [10] proposes a CRF based approachfor extracting methods of software entities such as pro-gramming languages libraries computing concepts andAPIs from informal software texts They report that extractAPI mentions is much more challenging than extractingother types of software entities A follow-up work by Yeet al [2] proposes to use a combination of orthographicword-clusters and API gazetteer features to enhance theCRFrsquos performance on API extraction tasks Although theseproposed features are effective they require much manualeffort to develop and tune Furthermore enough trainingdata has to be prepared and manually labeled for applyingtheir approach to the software text of each library to beprocessed These overheads pose a practical limitation todeploying their approach to a larger number of libraries

Our neural architecture is inspired by recent advancesof neural network techniques for NLP tasks For exampleboth RNNs and CNNs have been used to embed character-level features for question answering [55] [56] machinetranslation [57] text classification [58] and part-of-speechtagging [59] Some researchers also use word embeddingsand LSTMs for NER [35] [60] [61] To the best of our knowl-edge our neural architecture is the first machine learningbased API extraction method that combines these proposalsand customize them based on the characteristics of soft-ware texts and API names Furthermore the design of ourneural architecture also takes into account the deploymentoverhead of the API methods for multiple programminglanguages and libraries which has never been explored forAPI extraction tasks

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 14

8 CONCULSION

This paper presents a novel multi-layer neural architecturethat can effectively learn important character- word- andsentence-level features from informal software texts for ex-tracting API mentions in text The learned features has supe-rior performance than human-defined orthographic featuresfor API extraction in informal software texts This makes ourneural architecture easy to deploy because the only input itrequires is the texts to be processed In contrast existingmachine learning based API extraction methods have touse additional hand-crafted features such as word clustersor API gazetteers in order to achieve the performanceclose to that of our neural architecture Furthermore as thefeatures are automatically learned from the input texts ourneural architecture is easy to transfer and fine-tune acrossprogramming languages and libraries We demonstrate itstransferability across three Python libraries and across fourprogramming languages Our neural architecture togetherwith transfer learning makes it easy to train and deploya high-quality API extraction model for multiple program-ming languages and libraries with much less overall effortrequired for preparing training data and effective featuresIn the future we will further investigate the performanceand the transferability of our neural architecture in manyother programming languages and libraries moving to-wards real-world deployment of machine learning basedAPI extraction methods

REFERENCES

[1] C Parnin C Treude L Grammel and M-A Storey ldquoCrowddocumentation Exploring the coverage and the dynamics of apidiscussions on stack overflowrdquo

[2] D Ye Z Xing C Y Foo J Li and N Kapre ldquoLearning to extractapi mentions from informal natural language discussionsrdquo in Soft-ware Maintenance and Evolution (ICSME) 2016 IEEE InternationalConference on IEEE 2016 pp 389ndash399

[3] B Dagenais and M P Robillard ldquoRecovering traceability linksbetween an api and its learning resourcesrdquo in Software Engineering(ICSE) 2012 34th International Conference on IEEE 2012 pp 47ndash57

[4] S Subramanian L Inozemtseva and R Holmes ldquoLive api doc-umentationrdquo in Proceedings of the 36th International Conference onSoftware Engineering ACM 2014 pp 643ndash652

[5] C Chen Z Xing and X Wang ldquoUnsupervised software-specificmorphological forms inference from informal discussionsrdquo in Pro-ceedings of the 39th International Conference on Software EngineeringIEEE Press 2017 pp 450ndash461

[6] X Chen C Chen D Zhang and Z Xing ldquoSethesaurus Wordnetin software engineeringrdquo IEEE Transactions on Software Engineer-ing 2019

[7] H Li S Li J Sun Z Xing X Peng M Liu and X Zhao ldquoIm-proving api caveats accessibility by mining api caveats knowledgegraphrdquo in Software Maintenance and Evolution (ICSME) 2018 IEEEInternational Conference on IEEE 2018

[8] Z X D L X W Qiao Huang Xin Xia ldquoApi method recom-mendation without worrying about the task-api knowledge gaprdquoin Automated Software Engineering (ASE) 2018 33th IEEEACMInternational Conference on IEEE 2018

[9] B Xu Z Xing X Xia and D Lo ldquoAnswerbot automated gen-eration of answer summary to developersz technical questionsrdquoin Proceedings of the 32nd IEEEACM International Conference onAutomated Software Engineering IEEE Press 2017 pp 706ndash716

[10] D Ye Z Xing C Y Foo Z Q Ang J Li and N Kapre ldquoSoftware-specific named entity recognition in software engineering socialcontentrdquo in 2016 IEEE 23rd International Conference on SoftwareAnalysis Evolution and Reengineering (SANER) vol 1 IEEE 2016pp 90ndash101

[11] E F Tjong Kim Sang and F De Meulder ldquoIntroduction to the conll-2003 shared task Language-independent named entity recogni-tionrdquo in Proceedings of the seventh conference on Natural languagelearning at HLT-NAACL 2003-Volume 4 Association for Compu-tational Linguistics 2003 pp 142ndash147

[12] R Caruana ldquoLearning many related tasks at the same time withbackpropagationrdquo in Advances in neural information processing sys-tems 1995 pp 657ndash664

[13] A Arnold R Nallapati and W W Cohen ldquoExploiting featurehierarchy for transfer learning in named entity recognitionrdquo Pro-ceedings of ACL-08 HLT pp 245ndash253 2008

[14] J Yosinski J Clune Y Bengio and H Lipson ldquoHow transferableare features in deep neural networksrdquo in Advances in neuralinformation processing systems 2014 pp 3320ndash3328

[15] Z X Deheng Ye Lingfeng Bao and S-W Lin ldquoApireal An apirecognition and linking approach for online developer forumsrdquoEmpirical Software Engineering 2018 2018

[16] C dos Santos and M Gatti ldquoDeep convolutional neural networksfor sentiment analysis of short textsrdquo in Proceedings of COLING2014 the 25th International Conference on Computational LinguisticsTechnical Papers 2014 pp 69ndash78

[17] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[18] G Chen C Chen Z Xing and B Xu ldquoLearning a dual-languagevector space for domain-specific cross-lingual question retrievalrdquoin 2016 31st IEEEACM International Conference on Automated Soft-ware Engineering (ASE) IEEE 2016 pp 744ndash755

[19] C Chen X Chen J Sun Z Xing and G Li ldquoData-driven proac-tive policy assurance of post quality in community qampa sitesrdquoProceedings of the ACM on human-computer interaction vol 2 noCSCW p 33 2018

[20] I Bazzi ldquoModelling out-of-vocabulary words for robust speechrecognitionrdquo PhD dissertation Massachusetts Institute of Tech-nology 2002

[21] C Chen S Gao and Z Xing ldquoMining analogical libraries in qampadiscussionsndashincorporating relational and categorical knowledgeinto word embeddingrdquo in 2016 IEEE 23rd international conferenceon software analysis evolution and reengineering (SANER) vol 1IEEE 2016 pp 338ndash348

[22] C Chen and Z Xing ldquoSimilartech automatically recommendanalogical libraries across different programming languagesrdquo in2016 31st IEEEACM International Conference on Automated SoftwareEngineering (ASE) IEEE 2016 pp 834ndash839

[23] C Chen Z Xing and Y Liu ldquoWhats spains paris mining analogi-cal libraries from qampa discussionsrdquo Empirical Software Engineeringvol 24 no 3 pp 1155ndash1194 2019

[24] Y Huang C Chen Z Xing T Lin and Y Liu ldquoTell them apartdistilling technology differences from crowd-scale comparisondiscussionsrdquo in ASE 2018 pp 214ndash224

[25] O Levy and Y Goldberg ldquoNeural word embedding as implicitmatrix factorizationrdquo in Advances in neural information processingsystems 2014 pp 2177ndash2185

[26] D Tang F Wei N Yang M Zhou T Liu and B Qin ldquoLearningsentiment-specific word embedding for twitter sentiment classifi-cationrdquo in Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) vol 1 2014pp 1555ndash1565

[27] O Levy Y Goldberg and I Dagan ldquoImproving distributional sim-ilarity with lessons learned from word embeddingsrdquo Transactionsof the Association for Computational Linguistics vol 3 pp 211ndash2252015

[28] J Pennington R Socher and C Manning ldquoGlove Global vectorsfor word representationrdquo in Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP) 2014 pp1532ndash1543

[29] I Sutskever O Vinyals and Q V Le ldquoSequence to sequencelearning with neural networksrdquo in Advances in neural informationprocessing systems 2014 pp 3104ndash3112

[30] C Chen Z Xing and Y Liu ldquoBy the community amp for the com-munity a deep learning approach to assist collaborative editing inqampa sitesrdquo Proceedings of the ACM on Human-Computer Interactionvol 1 no CSCW p 32 2017

[31] S Gao C Chen Z Xing Y Ma W Song and S-W Lin ldquoA neuralmodel for method name generation from functional descriptionrdquo

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 15

in 2019 IEEE 26th International Conference on Software AnalysisEvolution and Reengineering (SANER) IEEE 2019 pp 414ndash421

[32] X Wang C Chen and Z Xing ldquoDomain-specific machine trans-lation with recurrent neural network for software localizationrdquoEmpirical Software Engineering pp 1ndash32 2019

[33] C Chen Z Xing Y Liu and K L X Ong ldquoMining likelyanalogical apis across third-party libraries via large-scale unsu-pervised api semantics embeddingrdquo IEEE Transactions on SoftwareEngineering 2019

[34] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no 11pp 2673ndash2681 1997

[35] Z Huang W Xu and K Yu ldquoBidirectional lstm-crf models forsequence taggingrdquo arXiv preprint arXiv150801991 2015

[36] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural computation vol 9 no 8 pp 1735ndash1780 1997

[37] H Sak A Senior and F Beaufays ldquoLong short-term memoryrecurrent neural network architectures for large scale acousticmodelingrdquo in Fifteenth annual conference of the international speechcommunication association 2014

[38] Y Bengio ldquoDeep learning of representations for unsupervised andtransfer learningrdquo in Proceedings of ICML Workshop on Unsupervisedand Transfer Learning 2012 pp 17ndash36

[39] S J Pan Q Yang et al ldquoA survey on transfer learningrdquo[40] N Srivastava G Hinton A Krizhevsky I Sutskever and

R Salakhutdinov ldquoDropout A simple way to prevent neural net-works from overfittingrdquo The Journal of Machine Learning Researchvol 15 no 1 pp 1929ndash1958 2014

[41] D P Kingma and J Ba ldquoAdam A method for stochastic optimiza-tionrdquo arXiv preprint arXiv14126980 2014

[42] A Bacchelli M Lanza and R Robbes ldquoLinking e-mails andsource code artifactsrdquo in Proceedings of the 32nd ACMIEEE Inter-national Conference on Software Engineering-Volume 1 ACM 2010pp 375ndash384

[43] A Bacchelli M DrsquoAmbros M Lanza and R Robbes ldquoBench-marking lightweight techniques to link e-mails and source coderdquoin Reverse Engineering 2009 WCRErsquo09 16th Working Conference onIEEE 2009 pp 205ndash214

[44] S J Pan and Q Yang ldquoA survey on transfer learningrdquo IEEETransactions on knowledge and data engineering vol 22 no 10 pp1345ndash1359 2010

[45] M Oquab L Bottou I Laptev and J Sivic ldquoLearning and transfer-ring mid-level image representations using convolutional neuralnetworksrdquo in Computer Vision and Pattern Recognition (CVPR) 2014IEEE Conference on IEEE 2014 pp 1717ndash1724

[46] S Ravi and H Larochelle ldquoOptimization as a model for few-shotlearningrdquo 2016

[47] A Marcus J Maletic et al ldquoRecovering documentation-to-source-code traceability links using latent semantic indexingrdquo in Soft-ware Engineering 2003 Proceedings 25th International Conference onIEEE 2003 pp 125ndash135

[48] H-Y Jiang T N Nguyen X Chen H Jaygarl and C K ChangldquoIncremental latent semantic indexing for automatic traceabil-ity link evolution managementrdquo in Proceedings of the 2008 23rdIEEEACM International Conference on Automated Software Engineer-ing IEEE Computer Society 2008 pp 59ndash68

[49] W Zheng Q Zhang and M Lyu ldquoCross-library api recommen-dation using web search enginesrdquo in Proceedings of the 19th ACMSIGSOFT symposium and the 13th European conference on Foundationsof software engineering ACM 2011 pp 480ndash483

[50] Q Gao H Zhang J Wang Y Xiong L Zhang and H Mei ldquoFixingrecurring crash bugs via analyzing qampa sites (t)rdquo in AutomatedSoftware Engineering (ASE) 2015 30th IEEEACM International Con-ference on IEEE 2015 pp 307ndash318

[51] C Treude and M P Robillard ldquoAugmenting api documentationwith insights from stack overflowrdquo in Software Engineering (ICSE)2016 IEEEACM 38th International Conference on IEEE 2016 pp392ndash403

[52] P C Rigby and M P Robillard ldquoDiscovering essential codeelements in informal documentationrdquo in Proceedings of the 2013International Conference on Software Engineering IEEE Press 2013pp 832ndash841

[53] L Moonen ldquoGenerating robust parsers using island grammarsrdquoin Reverse Engineering 2001 Proceedings Eighth Working Conferenceon IEEE 2001 pp 13ndash22

[54] A Bacchelli A Cleve M Lanza and A Mocci ldquoExtracting struc-tured data from natural language documents with island parsingrdquo

in Automated Software Engineering (ASE) 2011 26th IEEEACMInternational Conference on IEEE 2011 pp 476ndash479

[55] Y Kim Y Jernite D Sontag and A M Rush ldquoCharacter-awareneural language modelsrdquo in AAAI 2016 pp 2741ndash2749

[56] D Lukovnikov A Fischer J Lehmann and S Auer ldquoNeuralnetwork-based question answering over knowledge graphs onword and character levelrdquo in Proceedings of the 26th internationalconference on World Wide Web International World Wide WebConferences Steering Committee 2017 pp 1211ndash1220

[57] W Ling I Trancoso C Dyer and A W Black ldquoCharacter-basedneural machine translationrdquo arXiv preprint arXiv151104586 2015

[58] X Zhang J Zhao and Y LeCun ldquoCharacter-level convolutionalnetworks for text classificationrdquo in Advances in neural informationprocessing systems 2015 pp 649ndash657

[59] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[60] J P Chiu and E Nichols ldquoNamed entity recognition with bidirec-tional lstm-cnnsrdquo arXiv preprint arXiv151108308 2015

[61] G Lample M Ballesteros S Subramanian K Kawakami andC Dyer ldquoNeural architectures for named entity recognitionrdquoarXiv preprint arXiv160301360 2016

Suyu Ma is a research assistant in the Facultyof Information Technology at Monash UniversityHe has research interest in the areas of soft-ware engineering Deep learning and Human-computer Interaction He is currently focusing onimproving the usability and accessibility of mo-bile application He received the BS and MSdegrees from Beijing Technology and BusinessUniversity and the Australian National Universityin 2016 and 2018 respectively And he will bea PhD student at Monash University in 2020

under the supervision of Chunyang Chen

Zhenchang Xing is a Senior Lecturer in the Re-search School of Computer Science AustralianNational University Previously he was an Assis-tant Professor in the School of Computer Sci-ence and Engineering Nanyang TechnologicalUniversity Singapore from 2012-2016 Beforejoining NTU Dr Xing was a Lee Kuan Yew Re-search Fellow in the School of Computing Na-tional University of Singapore from 2009-2012Dr Xings current research area is in the interdis-plinary areas of software engineering human-

computer interaction and applied data analytics Dr Xing has over 100publications in peer-refereed journals and conference proceedings andhave received several distinguished paper awards from top softwareengineering conferences Dr Xing regularly serves on the organizationand program committees of the top software engineering conferencesand he will be the program committee co-chair for ICSME2020

Chunyang Chen is a lecturer (Assistant Pro-fessor) in Faculty of Information TechnologyMonash University Australia His research fo-cuses on software engineering deep learningand human-computer interaction He has pub-lished over 16 papers in referred journals orconferences including Empirical Software Engi-neering ICSE ASE CSCW ICSME SANERHe is a member of IEEE and ACM He receivedACM SIGSOFT distinguished paper award inASE 2018 best paper award in SANER 2016

and best tool demo in ASE 2016 httpschunyang-chengithubio

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 16

Cheng Chen received his Bachelor degree inSoftware Engineering from Northwest University(China) and received Master degree of Com-puting (major in Artificial Intelligence) from theAustralian National University He did some in-ternship projects about Natural Language Pro-cessing at CSIRO from 2016 to 2017 He workedas a research assistant supervised by Dr Zhen-chang Xing at ANU in 2018 Cheng currentlyworks in PricewaterhouseCoopers (PwC) Firmas a senior algorithm engineer of Natural Lan-

guage Processing and Data Analyzing Cheng is interested in NamedEnti ty Extraction Relation Extraction Text summarization and parallelcomputing He is working on knowledge engineering and transaction forNLP tasks

Lizhen Qu is a research fellow in the DialogueResearch lab in the Faculty of Information Tech-nology at Monash University He has exten-sive research experience in the areas of naturallanguage processing multimodal learning deeplearning and Cybersecurity He is currently fo-cusing on information extraction semantic pars-ing and multimodal dialogue systems Prior tojoining Monash University he worked as a re-search scientist at Data61CSIRO where he ledand participated in several research and indus-

trial projects including Deep Learning for Text and Deep Learning forCyber

Guoqiang Li is now an associate professor inschool of software Shanghai Jiao Tong Univer-sity and a guest associate professor in KyushuUniversity He received the BS MS and PhDdegrees from Taiyuan University of TechnologyShanghai Jiao Tong University and Japan Ad-vanced Institute of Science and Technology in2001 2005 and 2008 respectively He workedas a postdoctoral research fellow in the graduateschool of information science Nagoya Univer-sity Japan during 2008-2009 as an assistant

professor in the school of software Shanghai Jiao Tong Universityduring 2009-2013 and as an academic visitor in the department ofcomputer science University of Oxford during 2015-2016 His researchinterests include formal verification programming language theory anddata analytics and intelligence He published more than 40 researchespapers in the international journals and mainstream conferences includ-ing TDSC SPE TECS IJFCS SCIS CSCW ICSE FORMATS ATVAetc

Page 3: JOURNAL OF LA Easy-to-Deploy API Extraction by Multi-Level ... · sufficient high-quality training data for APIs of some less frequently discussed libraries or frameworks. Another

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 3

TABLE 1 Statistics of Polysemous APIs

Library APIs PolysemousAPIs

Percentage

Matplotlib 3877 622 1604Pandas 774 426 5504Numpy 2217 917 4136Opengl 850 52 612React 238 157 6597JDBC 1468 633 4312

The remainder of the paper is organized as follows Sec-tion 2 reports our empirical studies of the characteristics ofAPI-names API-mentions and discussion contexts Section 3defines the problem of API extraction Section 4 and Sec-tion 5 describe our neural architecture for API extraction andthe system implementation respectively Section 6 reportsour experiment results and findings Section 7 reviews therelated work Section 8 concludes our work and discuss thefuture work

2 EMPIRICAL STUDIES OF API-NAMING AND API-MENTION CHARACTERISTICS

In this work we aim to develop machine learning based APIextraction method that is not only effective but also easy-to-deploy across programming languages and libraries Tounderstand the challenges in achieving this objective andthe potential solution space we conduct empirical studies ofthe characteristics of API names API mentions in informaltexts and discussion contexts in which APIs are mentioned

We study six libraries three Python libraries Matplotlib(data visualization) Pandas (machine learning) Numpy(numeric computation) one C library OpenGL (computergraphics) one JavaScript library React (graphical user inter-face) and one Java library JDBC (database access) Theselibraries come from the four popular programming lan-guages and they support very diverse functionalities forcomputer programming

First we crawl API declarations of these libraries fromtheir official websites When different APIs have a samesimple name but different arguments in a same library wetreat such APIs as the same We examine each API nameto determine if the simple name of an API is a commonword (eg apply series draw) that can be found in a generalEnglish dictionary We find that different libraries havedifferent percentages of APIs with common-word simplenames OpenGL (6) Matplotlib (16) Numpy (41) JDBC(43) Pandas (55) React (66) When these APIs arementioned by their common-word simple names neithercharacter- nor word-level features can help to distinguishsuch polysemous API mentions from common words Wemust resort to discussion contexts of API mentions

Second by checking post tags we randomly sample200 Stack Overflow posts for each of the six libraries Wemanually label API mentions in these posts We examinethree characteristics of API mentions whether API mentionscontain explicit orthographic features (package or modulenames parentheses andor dot) whether API mentionsare long tokens (gt 10 characters) and whether the contextwindows (preceding and succeeding 5 words) around theAPI mentions contain common verbs and nouns (use call

TABLE 2 Statistics of API-Mention Characteristics

Library Orthographic Long tokens Commoncontext words

Matplotlib 6238 2156 2064Pandas 6711 3222 3423Numpy 6563 2687 2353Opengl 3373 3936 2080React 7556 2000 793JDBC 2636 6182 811

Average 5513 3364 1921

function method) Table 2 shows our analysis results Onaverage 5513 API mentions contain explicit orthographicand 3364 API mentions are long tokens Character-levelfeatures would be useful for recognizing these API men-tions However for the significant amount of API mentionsthat do not have such explicit character-level features weneed to resort to word- andor sentence-content featuresfor example the words (eg use call function method)that often appear in the context window of an API mentionto recognize API mentions

Furthermore we can observe that API-mention charac-teristics are not tightly coupled with a particular program-ming language or library Instead all six libraries exhibit cer-tain degree of the three API-mention characteristics But spe-cific degrees vary across libraries Fig 2 visualizes the top 50frequently-used words in the discussions of the six librariesWe can observe that discussions of different libraries sharecommon words but at the same time use library-specificwords (eg dataframe for Pandas matrix for Numpy figurefor Matplotlib query for JDBC render for OpenGL eventfor React) The commonalities of API mention characteristicsacross libraries indicate the feasibility of transfer learningFor example orthographic word andor sentence-contextfeatures learned from a source library could be applicableto a target library However due to the variations of API-name API-mention and discussion-context characteristicsacross libraries directly applying the source-library trainedmodel to the target library may suffer from performancedegradation unless the source and target libraries have verysimilar characteristics Therefore fine-tuning the source-library trained model with a small amount of target librarytext would be necessary

3 PROBLEM DEFINITION

In this work we takes as input informal software text (egStack Overflow posts) that discusses the usage and issuesof a particular library We assume that the software textof multiple programming languages and libraries need tobe processed Given a paragraph of informal software textour task is to recognize all API mentions (if any) in theparagraph as illustrated in the example in Fig 1 APImentions refer to tokens in the paragraph that representpublic modules classes methods or functions of a particularlibrary To preserving the integrity of code-like tokens anAPI mention is defined as a single token rather than a spanof tokens when the given text is tokenized properly

As our input is informal software text APIs may not beconsistently mentioned in their formal full names InsteadAPIs may be mentioned in abbreviations or synonymssuch as pandasrsquos dataframe for panadsDataFrame dfapply for

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 4

Fig 2 Word Clouds of the Top 50 Frequent Words in the Discussions of the Six Libraries

pandasDataFrameapply() APIs may also be mentioned intheir simple names such as apply series dataframe that canalso be common English words or computing terms in thetext Furthermore we do not assume that API mentions willbe consistently annotated with special tags Therefore ourapproach takes plain text as input

A related task to our work is API linking API extractionmethods classify whether a token in text is an API mentionor not while API linking methods link API mentions intext to API entities in a knowledge base [15] That is APIextraction is the prerequsite for API linking This work dealswith only API extraction

4 OUR NEURAL ARCHITECTURE

We formulate the task of API extraction in informal soft-ware texts as a sequence labeling problem and presenta neural architecture that labels each token in an inputtext as API or non-API As shown in Fig 3 our neuralarchitecture is composed of four main components 1) acharacter-level Convolutional Neural Network (CNN) forextracting character-level features of a token (Section 41)2) an unsupervised word embedding model for learningword semantics of a token (Section 42) 3) a BidirectionalLong Short-Term Memory network (Bi-LSTM) for extractingsentence-context features (Section 43) and 4) a softmaxclassifier for predicting the API or non-API label of a token(Section 44) Our neural model can be trained end-to-endwith pairs of input texts and their corresponding APInon-API label sequences A model trained with one libraryrsquos textcan be transferred to another libraryrsquos text by fine-tuning

Fig 3 Our Neural Architecture for API Extraction

source-library-trained components with the target libraryrsquostext (Section 45)

41 Extracting Char-Level Features by CNN

API mentions often have morphological features that dis-tinguish them from normal English words Such morpho-logical features may appear in the beginning of a token(eg the first letter capitalization System) in the middle(eg the hyphen in read csv the dot in pandasseries the left

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 5

Fig 4 Our Character Vocabulary

Fig 5 Character-Level CNN

parenthesis and comma in print(a32)) or at the end (egthe right parenthesis in apply()) Morphological features mayalso appear in combination such as camelcase writing likeAbstractAction a pair of parentheses like plot() The longlength of some tokens like createdataset is one importantmorphological feature as well Due to the lack of universalnaming convention across libraries and the wide presenceof informal writing forms informative morphological fea-tures of API mentions often vary from one libraryrsquos text toanother

Robust methods to extract morphological features fromtokens must take into account all characters of the tokenand determine which features are more important for a par-ticular libraryrsquos APIs [16] To that end we use a character-level CNN [17] which extracts local features in N-gramcharacters of the token using a convolution operation andthen combines them using a max-pooling operation to createa fixed-sized character-level embedding of the token [18][19]

Let V char be the vocabulary of characters for the soft-ware texts from which we want to extract API mentionsIn this work V char for all of our models consists of 92characters including 26 English letters (both upper andlower case) 10 digits 30 other characters (eg rsquorsquorsquo[rsquorsquorsquo)as listed in Fig 4 Note that V char can be easily extendedfor different datasets Let Echar isin Rdchartimes|V char| be thecharacter embedding matrix where dchar is the dimension ofcharacter embeddings and |V char| is the vocabulary size (92in this work) As illustrated in Fig 3 Echar can be regardedas a dictionary of character embeddings in which a columndchar-dimensional vector corresponds to a particular char-acter The character embeddings are initialized as one-hotvectors and then learned during the training of character-level CNN Given a character c isin V char its embedding ec

can be retrieved by the matrix-vector product ec = Echarvc

where vc is a one-hot vector of size |V char| which has value1 at index c and zero in all other positions

Fig 5 presents the architecture of our character-levelCNN Given a token w in the input text letrsquos assume wis composed of M characters c1 c2 cM We first obtaina sequence of character embeddings ec1 ec2 ecM by

looking up the character embeddings matrix Echar Thissequence of character embeddings (zero-padding at thebeginning and the end of the sequence) is the input to ourcharacter-level CNN In our application of CNN becauseeach character is represented as a dchar-dimensional vectorwe use convolution filters with widths equal to the dimen-sionality of the character embeddings (ie dchar) Then wecan vary the height h (or window size) of the filter iethe number of adjacent characters considered jointly in theconvolution operation

Let zm be the concatenation of the character embeddingsof cm (1 le m le M) the (h minus 1)2 left neighbors of cmand the (h minus 1)2 right neighbors of cm A convolutionoperation involves a filter W isin Rhdchar

(a matrix of htimesdchar

dimensions) and a bias term b isin Rh which is appliedrepeatedly to each character window of size h in the inputsequence c1 c2 cM

om = ReLU(WT middot zm + b)

where ReLU(x) = max(0 x) is the non-linear activa-tion function The convolution operations produce a M -dimensional feature map for a particular filter A 1D-maxpooling operation is then applied to the feature map toextract a scalar (ie a feature vector of length 1) with themaximum value in the M dimensions of the feature map

The convolution operation extracts local features withineach character window of the given token and using themax over all character windows of the token we extracta global feature for the token We can apply N filters toextract different features from the same character windowThe output of the character-level CNN is an N -dimensionalfeature vector representing the character-level embedding ofthe given token We denote this embedding as echarw for thetoken w

In our character-level CNN the matrices Echar and W and the vector b are parameters to be learned The dimen-sionality of the character embedding dchar the number offilters N and the window size of the filters h are hyper-parameters to be chosen by the user (see Section 52 formodel configuration)

42 Learning Word Semantics by Word Embedding

In informal software texts the same API is often mentionedin many non-standard abbreviations and synonyms [5] Forexample the Pandas library is often written as pd and itsmodule DataFrame is often written as df Furthermore thereis lack of consistent use of verb noun and preposition inthe discussions [2] For example in the sentences ldquoI havedecided to use apply rdquo ldquoif you run apply on a series rdquoand ldquoI tested with apply rdquo users refer to a Pandasrsquos methodapply() but their descriptions vary greatly

Such variations result in out-of-vocabulary (OOV) issuefor a machine learning model [20] ie variations that havenot been seen in the training data For the easy deploymentof a machine learning model for API extraction it is imprac-tical to address the OOV issue by manually labeling a hugeamount of data and developing a comprehensive gazetteerof API names and their common variations [2] Howeverwithout the knowledge about variations of semantically

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 6

similar words the trained model will be restricted to theexamples that it sees in the training data

To address this dilemma we propose to exploit unsu-pervised word-embedding models [21] [22] [23] [24] tolearn distributed word representations from a large amountof unlabeled text discussing a particular library Manystudies [25] [26] [27] have shown that distributed wordrepresentations can capture rich semantics of words suchthat semantically similar words will have similar wordembeddings

Let V word be the vocabulary of words for the corpusof software texts to be processed As illustrated in Fig 3word-level embeddings are encoded by column vectors ina word embedding matrix Eword isin Rdwordtimes|V word| wheredword is the dimension of word embeddings and |V word|is the vocabulary size Each column Eword

i isin Rdword

corre-sponds to the word-level embedding of the i-th word inthe vocabulary V word We can obtain a token wrsquos word-level embedding eword

w by looking up Eword with the wordw ie eword

w = Ewordvw where vw is a one-hot vector ofsize |V word| which has value 1 at index w and zero in allother positions The matrix Eword is to be learned usingunsupervised word embedding models (eg GloVe [28])and dword is a hyper-parameter to be chosen by the user(see Section 52 for model configuration)

In this work we adopt the Global Vectors for WordRepresentation (GloVe) method [28] to learn the matrixEword GloVe is an unsupervised algorithm for learningword representations based on the statistics of word co-occurrences in an unlabeled text corpus It calculates theword embeddings based on a word co-occurrence matrixX Each row in the word co-occurrence matrix correspondsto a word and each column corresponds to a context Xij

is the frequency of word i co-occurring with word j andXi =

983123Xik (1 le k le |V word|) is the total number of occur-

rences of word i in the corpus The probability of word j thatoccurs in the context of word i is Pij = P (j|i) = XijXiWe have log(Pij) = log(Xij)minus log(Xi)

GloVe defines log(Pij) = eTwiewj where ewi and ewj

are the word embeddings to be learned for the word wi

and wj This gives the constraint for each word pair aslog(Xij) = eTwi

ewj + bi + bj where b is the bias termfor ew The cost function for minimizing the loss of wordembeddings is defined as

V word983131

ij=1

f(Xij)(eTwiewj

+ bi + bj minus log(Xij))

where f(Xij) is a weighting function That is GloVe learnsword embeddings by a weighted least square regressionmodel

43 Extracting Sentence-Context Features by Bi-LSTM

In informal software texts many API mentions cannot be re-liably recognized by simply examining a tokenrsquos character-level features and word semantics This is because manyAPIs are named using common English words (eg seriesapply plot) or common computing terms (eg dataframesigmoid histgram argmax zip list) When such APIs are men-tioned in their simple name this results in a common-word

polesemy issue for API extraction [2] In such situations wehave to disambiguate the API sense of a common word fromthe normal sense of the word

To that end we have to look into the sentence context inwhich an API is mentioned For example by looking intothe sentence context of the two sentences in Fig 1 we candetermine that the ldquoapplyrdquo in the first sentence is an APImention while the ldquoapplyrdquo in the second sentence is not anAPI mention Note that both the preceding and succeedingcontext of the token ldquoapplyrdquo are useful for disambiguatingthe API or the normal sense of the token

We use Recurrent Neural Network (RNN) to extractsentence context features for disambiguating the API or thenormal sense of the word [29] RNN is a class of neuralnetworks where connections between units form directedcycles and it is widely used in software engineering do-main [30] [31] [32] [33] Due to this nature it is especiallyuseful for tasks involving sequential inputs [29] like sen-tences In our task we adopt a Bidirectional RNN (Bi-RNN)architecture [34] [35] which is composed of two LSTMsone takes input from the beginning of the text forward tilla particular token while the other takes input from the endof the text backward till that token

The input to an RNN is a sequence of vectors In ourtask we obtain the input vector of a token in the inputtext by concatenating the character-level embedding of thetoken w and the word-level embedding of the token wie echarw oplus eword

w An RNN recursively maps an inputvector xt and a hidden state htminus1 to a new hidden state htht = f(htminus1 xt) where f is a non-linear activation function(eg an LSTM unit used in this work) A hidden stateis a vector esent isin Rdsent

summarizing the sentence-levelfeatures till the input xt where dsent is the dimension of thehidden state vector to be chosen by the user We denote efsentand ebsent as the hidden states computed by the forwardLSTM and the backward LSTM after reading the end of thepreceding and the succeeding sentence context of a tokenrespectively efsent and ebsent are concatenated into one vectoresentw as the Bi-LSTM output for the corresponding token win the input text

As an input text (eg a Stack Overflow post) can be along text modeling long-range dependencies in the text iscrucial for our task For example a mention of a libraryname at the beginning of a post could be important fordetecting a mention of a method of this library later inthe post Therefore we adopt the LSTM unit [36] [37] inour RNN The LSTM is designed to cope with the gradientvanishing problem in RNN An LSTM unit consists of amemory cell and three gates namely the input outputand forget gates Conceptually the memory cell stores thepast contexts and the input and output gates allow thecell to store contexts for a long period of time Meanwhilesome contexts can be cleared by the forget gate Memorycell and the three gates have weights and bias terms tobe learned during model training Bi-LSTM can extract thecontext feature For instance in the Pandas library sentencerdquoThis can be accomplished quite simply with the DataFramemethod applyrdquo based on the context information from theword method the Bi-LSTM can help our model classifythe word apply as API mention and the learning can be

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 7

transferred across different languages and libraries withtransfer learning

44 API Labeling by Softmax ClassificationGiven the Bi-LSTMrsquos output vector esentw for a token w inthe input text we train a binary softmax classifier to predictwhether the token is an API mention or a normal Englishword That is we have two classes in this work ie API ornon-API Softmax predicts the probability for the j-th classgiven a tokenrsquos vector esentw by

P (j|esentw ) =exp(esentw WT

j )9831232

k=1 exp(esentw WT

k )

where the vectors Wk (k=1 or 2) are parameters to belearned

45 Library Adaptation by Transfer LearningTransfer learning [12] [38] is a machine learning methodthat stores knowledge gained while solving one problemand applying it to a different but related problem It helpsto reduce the amount of training data required for the targetproblem When transfer learning is used in deep learningit has been shown to be beneficial for narrowing down thescope of possible models on a task by using the weights ofa trained model on a different but related task [14] and theshared parameters transfer learning can help model adaptthe shared knowledge in similar context [39]

API extraction tasks for different libraries can be re-garded as a set of different but related tasks Therefore weuse transfer learning to adapt a neural model trained withone libraryrsquos text (source-library-trained model) for anotherlibraryrsquos text (target library) Our neural architecture is com-posed of four main components character-level CNN wordembeddings sentence Bi-LSTM and softmax classifier Wecan transfer all or some source-library-trained model com-ponents to a target library model Without transfer learn-ing the parameters of a target-library model componentwill be randomly initialized and then learned using thetarget-library training data ie trained from scratch Withtransfer learning we use the parameters of source-library-trained model components to initialize the parameters ofthe corresponding target-library model components Aftertransferring the model parameters we can either freeze thetransferred parameters or fine-tune the transferred parame-ters using the target-library training data

5 SYSTEM IMPLEMENTATION

This section describes the current implementation 2 of ourneural architecture for API extraction

51 Preprocessing and Tokenizing Input TextOur current implementation takes as input the content of aStack Overflow post As we want to recognize API mentionswithin natural language sentences we remove stand-alonecode snippets in ltpregtltcodegt tag We then remove allHTML tags in the post content to obtain the plain text

2 httpsgithubcomJOJO201API Extraction

input (see Section 3 for the justification of taking plain textas input) We develop a sentence parser to split the pre-processed post text into sentences by punctuation Follow-ing [2] we develop a software-specific tokenizer to tokenizethe sentences This tokenizer preserves the integrity of code-like tokens For example it treats matplotlibpyplotimshow()as a single token instead of a sequence of 7 tokens ieldquomatplotlibrdquo ldquordquo ldquopyplotrdquo ldquordquo ldquoimshowrdquo ldquo(rdquo ldquo)rdquo producedby general English tokenizers

52 Model ConfigurationWe now describe the hyper-parameter settings used in ourcurrent implementation These hyper-parameters are tunedusing the validation data during model training (see Sec-tion 612 for the description of our dataset) We find thatthe neural model has very similar performance across sixlibraries dataset with the same hyper-parameters Thereforewe keep the same hyper-parameters for the six librariesdataset which can also avoid the difficulty in scaling dif-ferent hyper-parameters

521 Character-level CNNWe set the filter window size h = 3 That is the convolutionoperation extracts local features from 3 adjacent charactersat a time The size of our current character vocabulary V char

is 92 Thus we set the dimensionality of the character em-bedding dchar at 92 We initialize the character embeddingwith one-hot vector (1 at one character index and zeroin all other dimensions) The character embeddings willbe updated through back propagation during the trainingof character-level CNN We experiment 5 different N (thenumber of filters) 20 40 60 80 100 With N = 40 theCNN has an acceptable performance on the validation dataWith N = 60 and above the CNN has almost the sameperformance as N = 40 but it takes more training epochsto research the acceptable performance Therefore we useN = 40 That is the character-level embedding of a tokenhas the dimension 40

522 Pre-trained word embeddingsOur experiments involve six Python libraries Pandas Mat-plotlib NumPy OpenGL React and JDBC We collect allquestions tagged with these six libraries and all answers tothese questions in the Stack Overflow Data Dump releasedon March 18 2018 We obtain a text corpus of 380971 postsWe use the same preprocessing and tokenization steps asdescribed in Section 51 to preprocess and tokenize thecontent of these posts Then we use the GloVe [28] to learnword embeddings from this text corpus We set the wordvocabulary size |V word| at 40000 The training epoch is setat 100 to ensure the sufficient training of word embeddingsWe experiment four different dimensions of word embed-dings dword 50 100 200 400 We use dword = 200 inour current implementation as it produces a good balanceof the training efficiency and the quality of word embed-dings for API extraction on the validation data We alsoexperiment pre-trained word embeddings with all StackOverflow posts which does not significantly affect the APIextraction performance but requires much longer time fortext preprocessing and word embeddings learning

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 8

523 Sentence-context Bi-LSTMFor the RNN we use 50 hidden LTSM units to store thehidden states The dimension of the hidden state vector is50 Therefore the dimension of the output vector esentw is100 (concatenating forward and backward hidden states) Inorder to mitigate overfitting [40] a dropout layer is addedon the output of BLSTM with the dropout rate 05

53 Model TrainingTo train our neural model we use input text and itscorresponding sequence of APInon-API labels (see Sec-tion 612 for our data labeling process) The optimizerused is Adam [41] which performs well for training RNNincluding Bi-LSTM [35] The training epoch is set at 40 timesThe best performance model on the validation set is savedfor testing This modelrsquos parameters are also saved for thetransfer learning experiments

6 EXPERIMENTS

We conduct a series of experiments to answer the followingfour research questionsbull RQ1 How well can our neural architecture learn multi-

level features from the input texts Can the learned fea-tures support high-quality API extraction compared withexisting machine learning based API extraction methods

bull RQ2 What is the impact of the three feature extractors(character-level CNN word embeddings and sentence-context Bi-LSTM) on the API extraction performance

bull RQ3 How effective is transfer learning for adapting APIextraction models across libraries of the same languagewith different amount of target-library training data

bull RQ4 How effective is transfer learning for adapting APIextraction models across libraries of different languageswith different amount of target-library training data

61 Experiments SetupThis section describes the libraries used in our experimentshow we prepare training and testing data and the evalua-tion metrics of API extraction performance

611 Studied librariesOur experiments involve three Python libraries (PandasNumPy and Matplotlib) one Java library (JDBC) oneJavaScript library (React) and one C library (OpenGL) Asreported in Section 2 these six libraries support very di-verse functionalities for computer programming and havedistinct API-naming and API mention characteristics Usingthese libraries we can evaluate the effectiveness of ourneural architecture for API extraction in very diverse datasettings We can also gain insights into the effectiveness oftransfer learning for API extraction and the transferabilityof our neural architecture in different settings

612 DatasetWe collect Stack Overflow posts (questions and their an-swers) tagged with pandas numpy matplotlib opengl reactor jdbc as our experimental data We use Stack OverflowData Dump released on March 18 2018 In this data dump

TABLE 3 Basic Statistics of Our Dataset

Library Posts Sentences API mentions TokensMatplotlib 600 4920 1481 47317

Numpy 600 2786 1552 39321Pandas 600 3522 1576 42267Opengl 600 3486 1674 70757JDBC 600 4205 1184 50861React 600 3110 1262 42282Total 3600 22029 8729 292805

380971 posts are tagged with one of the six studied librariesOur data collection process follows four criteria First thenumber of posts selected and the number of API mentionsin these posts should be at the same order of magnitudeSecond API mentions should exhibit the variations of APIwriting forms commonly seen on Stack Overflow Third theselection should avoid repeatedly selecting frequently men-tioned APIs Fourth the same post should not appear in thedataset for different libraries (one post may be tagged withtwo or more studied-library tags) The first criterion ensuresthe fairness of comparison across libraries the second andthird criteria ensure the representativeness of API mentionsin the dataset and the fourth criterion ensures that there isno repeated data in different datasets

We finally include 3600 posts (600 for each library) in ourdataset and each post has at least one API mention3 Table 3summarizes the basic statistics of our dataset These postshave in total 22029 sentences 292805 token occurrencesand 41486 unique tokens after data preprocessing and to-kenization We manually label the API mentions in theselected posts 6421 sentences (3407) contains at least oneAPI mention Our dataset has in total 8729 API mentionsThe selected Matplotlib NumPy Pandas OpenGL JDBC andReact posts have 1481 1552 1576 1674 1184 and 1262 APImentions which refer to 553 694 293 201 282 and 83unique APIs of the six libraries respectively In additionwe randomly sample 100 labelled Stack Overflow posts foreach library dataset Then we examine each API mention todetermine whether it is a simple name or not and we findthat among the 600 posts there are 1634 API mentions ofwhich 694 (4247) are simple API names Our dataset notonly contains a large number of API mentions in diversewriting forms but also contains rich discussion contextaround API mentions This makes it suitable for the trainingof our neural architecture the testing of its performanceand the study of model transferability

We randomly split the 600 posts of each library into threesubsets by 622 ratio 60 as training data 20 as validationdata for tuning model hyper-parameters and 20 as testingdata to answer our research questions Since cross validationwill cost a great deal of time and our training dataset is un-biased (proved in Section 64) cross validation is not usedOur experiment results also show that the training datasetis enough to train a high-performance neural networkBesides the performance of the model in target library canbe improved by adding more training dataset from otherlibraries And if more training dataset is added there is noneed to label a new validate dataset for the target libraryto re-tune the hyper-parameters because we find that the

3 httpsdrivegooglecomdrivefolders1f7ejNVUsew9l9uPCj4Xv5gMzNbqttpoausp=sharing

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 9

TABLE 4 Comparison of CRF Baselines [2] and Our Neural Model for API Extraction

Basic CRF Full CRF Our method

Library Prec Recall F1 Prec Recall F1 Prec Recall F1

Matplotlib 6735 4784 5662 8962 6131 7282 815 8392 827NumPy 7583 3991 5229 8921 4931 6351 7824 6291 6974Pandas 6206 7049 6601 9711 7121 8216 8273 853 8280Opengl 4391 9187 5942 9457 7062 8085 8583 8374 8477JDBC 1583 8261 2658 8732 5140 6471 8469 5531 6692React 1674 8858 2816 9742 7011 8153 8795 7817 8277

neural model has very similar performance with the samehyper-parameters under different dataset settings

613 Evaluation metricsWe use precision recall and F1-score to evaluate the per-formance of an API extraction method Precision measureswhat percentage the recognized API mentions by a methodare correct Recall measures what percentage the API men-tions in the testing dataset are recognized correctly by amethod F1-score is the harmonic mean of precision andrecall ie 2 lowast ((precision lowast recall)(precision+ recall))

62 Performance of Our Neural Model (RQ1)Motivation Recently linear-chain CRF has been used tosolve the API extraction problem in informal text [2] Theyshow that machine-learning based API extraction with thatof several outperforms several commonly-used rule-basedmethods [2] [42] [43] The approach in [2] relies on human-defined orthographic features two different unsupervisedlanguage models (class-based Brown clustering and neural-network based word embedding) and API gazetteer features(API inventory) In contrast our neural architecture usesneural networks to automatically learn character- word-and sentence-level features from the input texts The firstRQ is to confirm the effectiveness of these neural-networkfeature extractors for API extraction tasks by comparingthe overall performance of our neural architecture with theperformance of the linear-chain CRF with human-definedfeatures for API extractionApproach We use the implementation of the CRF modelin [2] to build the two CRF baselines The basic CRF baselineuses only the orthographic features developed in [2] Thefull CRF baseline uses all orthographic word-clusters andAPI gazetteer features developed in [2] The basic CRF base-line is easy to deploy because it uses only the orthographicfeatures in the input texts but not any advanced word-clusters and API gazetteer features And the self-trainingprocess is not used in these two baselines since we havesufficient training data In contrast the full CRF baselineuses advanced features so that it is not as easy-to-deploy asthe basic CRF But the full CRF has much better performancethan the basic CRF as reported in [2] Comparing our modelwith these two baselines we can understand whether ourmodel can achieve a good tradeoff between easy-to-deployand high performance Note that as Ye et al [2] alreadydemonstrates the superior performance of machine-learningbased API extraction methods over rule-based methods wedo not consider rule-based baselines for API extraction inthis studyResults From Table 4 we can see that

bull Although linear CRF with full features has close performance toour model linear CRF with only orthographic features performspoorly in distinguishing API tokens from non-API tokens Thebest F1-score of the basic CRF is only 066 for PandasThe F1-score of the basic CRF for Matplotlib Numpy andOpenGL is 052-059 For JDBC and React the F1-score ofthe basic CRF is below 03 Our results are consistent withthe results in [2] when advanced word-clusters and API-gazetteer features are ablated Compared with the basicCRF the full CRF performs much better For Pandas JDBCand React the full CRF has very close performance to ourmodel But for Matplotlib Numpy and OpenGL our modelstill outperforms the full CRF by at least four points inF1-score

bull Multi-level feature embedding by our neural architecture iseffective in distinguishing API tokens from non-API tokensin the resulting embedding space All evaluation metrics ofour neural architecture are significantly higher than thoseof the basic CRF Although our neural architecture per-forms relatively worse on NumPy and JDBC its F1-scoreis still much higher than the F1-score of the basic CRFon NumPy and JDBC We examine false positives (non-API token recognized as API) and false negatives (APItoken recognized as non-API) by our neural architectureon NumPy and JDBC We find that many false positivesand false negatives involve API mentions composed ofcomplex strings for example array expressions for Numpyor SQL statements for JDBC It seems that neural networkslearn some features from such complex strings that mayconfuse the model and cause the failure to tell apart APItokens from non-API ones Furthermore by analysing theclassification results for different API mentions we findthat our model has good generalizability on different APIfunctions

With linear CRF for API extraction we cannot achieve agood tradeoff between easy-to-deploy and high performance Incontrast our neural architecture can extract effective character-word- and sentence-level features from the input texts and theextracted features alone can support high-quality API extractionwithout using any hand-crafted advanced features such as wordclusters API gazetteers

63 Impact of Feature Extractors (RQ2)Motivation Although our neural architecture achieves verygood performance as a whole we would like to furtherinvestigate how much different feature extractors (character-level CNN word embeddings and sentence-context Bi-LSTM) contribute to this good performance and how dif-ferent features affect precision and recall of API extraction

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 10

TABLE 5 The Results of Feature Ablation Experiments

Ablating char CNN Ablating Word Embeddings Ablating Bi-LSTM All featuresPrec Recall F1-score Prec Recall F1-score Prec Recall F1-score Prec Recall F1-score

Matplotlib 7570 7232 7398 8175 6399 7179 8259 6934 7544 8150 8392 8270Numpy 8100 6040 6921 7931 5847 6731 8062 5667 6644 7824 6291 6974Pandas 8050 8328 8182 8177 6801 7425 8322 7522 7965 8273 8530 8280Opengl 7758 8537 8129 8307 6803 7504 9852 7208 8305 8583 8374 8477JDBC 6865 6125 6447 6422 6562 6491 9928 4312 6613 8469 5531 6692React 6990 8571 7701 8437 7500 7941 9879 6508 7847 8795 7817 8277

Approach We ablate one kind of feature at a time fromour neural architecture That is we obtain a model withoutcharacter-level CNN one without word embeddings andone without sentence-context Bi-LSTM We compare theperformance of these three models with that of the modelwith all three feature extractors

Results In Table 5 we highlight in bold the largest drop ineach metric for a library when ablating a feature comparedwith that metric with all features We underline the increasein each metric for a library when ablating a feature com-pared with that metric with all features We can see that

bull The performance of our neural architecture is contributed bythe combined action of all its features Ablating any of thefeatures the F1-score degrades Ablating word embed-dings or Bi-LSTM causes the largest drop in F1-scorefor four libraries while ablating char-CNN causes thelargest drop in F1-score for two libraries Feature ablationhas higher impact on some libraries than others Forexample ablating char-CNN word embeddings and Bi-LSTM all cause significant drop in F1-score for Matplotliband ablating word embeddings causes significant dropin F1-score for Pandas and React In contrast ablating afeature causes relative minor drop in F1-score for NumpyJDBC and React This indicates different levels of featureslearned by our neural architecture can be more distinct forsome libraries but more complementary for others Whendifferent features are more distinct from one anotherachieving good API extraction performance relies moreon all the features

bull Different features play different roles in distinguishing APItokens from non-API tokens Ablating char-CNN causesthe drop in precision for five libraries except NumpyThe precision degrades rather significantly for Matplotlib(71) OpenGL (96) JDBC (189) and React (205)In contrast ablating char-CNN causes significant drop inrecall only for Matplotlib (138) For JDBC and React ab-lating char-CNN even causes significant increase in recall(107 and 96 respectively) Different from the impactof ablating char-CNN ablating word embeddings and Bi-LSTM usually causes the largest drop in recall rangingfrom 99 drop for Numpy to 237 drop for MatplotlibAblating word embeddings causes significant drop inprecision only for JDBC and it causes small changes inprecision for the other five libraries (three with small dropand two with small increase) Ablating Bi-LSTM causesthe increase in precision for all six libraries Especiallyfor OpenGL JDBC and React when ablating Bi-LSTM themodel has almost perfect precision (around 99) but thiscomes with a significant drop in recall (139 for OpenGL193 for JDBC and 167 React)

All feature extractors are important for high-quality API ex-traction Char-CNN is especially useful for filtering out non-API tokens while word embeddings and Bi-LSTM are especiallyuseful for recalling API tokens

64 Effectiveness of Within-Language Transfer Learn-ing (RQ3)Motivation As described in Section 611 we intentionallychoose three different-functionality Python libraries PandasMatplotlib NumPy Pandas and NumPy are functionally sim-ilar while the other two pairs are more distant The threelibraries also have different API-naming and API-mentioncharacteristics (see Section 2) We want to investigate theeffectiveness of transfer learning for API extraction acrossdifferent pairs of libraries and with different amount oftarget-library training dataApproach We use one library as source library and one ofthe other two libraries as target library We have six transfer-learning pairs for the three libraries We denote them assource-library-name rarr target-library-name such as Pandas rarrNumPy which represents transferring Pandas-trained modelto NumPy text Two of these six pairs (Pandas rarr NumPy andNumPy rarr Pandas) are similar-libraries transfer (in terms oflibrary functionalities and API-namingmention character-istics) while the other four pairs involving Matplotlib arerelatively more-distant-libraries transfer

TABLE 6 NumPy (NP) or Pandas (PD) rarr Matplotlib (MPL)NPrarrMPL PDrarrMPL MPL

Prec Recall F1 Prec Recall F1 Prec Recall F111 8264 8929 8584 8102 8006 8058 8767 7827 827012 8184 8452 8316 7161 8333 7696 8138 7024 754014 7183 7589 7381 6788 7798 7265 8122 5536 658418 7056 7560 7298 6966 7381 7171 7500 5357 6250116 7356 7202 7278 6648 7202 6916 8070 2738 4089132 7256 7883 7369 7147 6935 7047 9750 1160 2074DU 7254 6607 6916 7699 5476 6400

TABLE 7 Matplotlib (MPL) or Pandas (PD) rarr NumPy (NP)MPLrarrNP PDrarrNP NP

Prec Recall F1 Prec Recall F1 Prec Recall F111 8685 7708 8168 7751 6750 7216 7824 6292 697412 788 8208 8041 7013 6751 6879 7513 6042 669714 9364 7667 8000 6588 7000 6781 7714 4500 568418 7373 7833 7596 6584 6667 6625 7103 4292 5351116 7619 6667 7111 5833 6417 6114 5707 4875 5258132 7554 5792 6556 6027 5625 5823 7272 2333 3533DU 6416 6525 6471 6296 5932 6108

We use gradually-reduced target-library training data(1 for all data 12 14 132) to fine-tune the source-library-trained model We also train a target-library modelfrom scratch (ie with randomly initialized model parame-ters) using the same proportion of the target-library trainingdata for comparison We also use the source-library-trained

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 11

TABLE 8 Matplotlib (MPL) or NumPy (NP) rarr Pandas (PD)MPLrarrPD NPrarrPD PD

Prec Recall F1 Prec Recall F1 Prec Recall F111 8497 8962 8723 8686 8761 8723 8043 8530 828012 8618 8443 8530 8397 8300 8348 8801 7406 804314 8107 8761 8421 8107 8270 8188 7650 7695 767218 8754 8098 8413 8576 7637 8079 6930 8329 7565

116 8204 7896 8047 8245 7579 7898 8421 5072 5331132 8131 7522 7814 8143 7205 7645 8425 3084 4514DU 7165 4006 5139 7500 3545 4814

TABLE 9 Averaged Matplotlib (MPL) or NumPy(NP)rarrPandas (PD)

MPLrarrPD NPrarrPDPrec Recall F1 Prec Recall F1

116 8333 7781 8043 8339 7522 7909132 7950 7378 7653 8615 7115 7830

model directly without any fine-tuning (ie 01 target-library data) as a baseline We also randomly select 116 and132 training data for NumPy rarr Pandas and Matplotlib rarrPandas for 10 times and train the model for each time Thenwe calculate the averaged precision and recall values Basedon the averaged precision and recall values we calculate theaveraged F1-scoreResults Table 6 Table 7 and Table 8 show the experimentresults of the six pairs of transfer learning Table 9 showsthe averaged experiment results of NumPy rarr Pandas andMatplotlib rarr Pandas with 116 and 132 training dataAcronyms stand for MPL (Matplotlib) NP (NumPy) PD(Pandas) DU (Direct Use) The last column is the results oftraining the target-library model from scratch We can seethatbull Transfer learning can produce a better-performance target-

library model than training the target-library model fromscratch with the same proportion of training data This is ev-ident as the F1-score of the target-library model obtainedby fine-tuning the source-library-trained model is higherthan the F1-score of the target-library model trained fromscratch in all the transfer settings but PD rarr MPL at11 Especially for NumPy the NumPy model trainedfrom scratch with all Numpy training data has F1-score6974 However the NumPy model transferred from theMatplotlib model has F1-score 8168 (171 improvement)For the Matplotlib rarr Nump transfer even with only 116Numpy training data for fine-tuning the F1-score (7111) ofthe transferred NumPy model is still higher than the F1-score of the NumPy model trained from scratch with allNumpy training data This suggests that transfer learningmay boost the model performance even for the difficultdataset Such performance boost by transfer learning hasalso been observed in many studies [14] [44] [45] Thereason is that a target-library model can ldquoreuserdquo muchknowledge in the source-library model rather than hav-ing to learning completely new knowledge from scratch

bull Transfer learning can reduce the amount of target-library train-ing data required to obtain a high-quality model For four outof six transfer-learning pairs (NP rarr MPL MPL rarr NP NPrarr PD MPL rarr PD) the reduction ranges from 50 (12)to 875 (78) while the resulting target-library model stillhas better F1-score than the target-library model trainedfrom scratch using all target-library training data If we al-low a small degradation of F1-score (eg 3 of the F1-sore

for the target-library model trained from scratch usingall target-library training data) the reduction can go upto 938 (1516) For example for Matplotlib rarr Pandasusing 116 Pandas training data (ie only about 20 posts)for fine-tuning the obtained Pandas model still has F1-score 8047

bull Transfer learning is very effective in few-shot training settingsFew-shot training refers to training a model with onlya very small amount of data [46] In our experimentsusing 132 training data ie about 10 posts for transferlearning the F1-score of the obtained target-library modelis still improved a few points for NP rarr MPL PD rarr MPLand MPL rarr NP and is significantly improved for MPL rarrPD (529) and NP rarr PD (583) compared with directlyreusing source-library-trained model without fine-tuning(DU row) Furthermore the averaged results of NP rarrPD and MPL rarr PD for 10 randomly selected 116 and132 training data are similar to the results in Table 8and the variances are all smaller than 03 which showsour training data is unbiased Although the target-librarymodel trained from scratch still has reasonable F1-scorewith 12 to 18 training data they become completelyuseless with 116 or 132 training data In contrast thetarget-library models obtained through transfer learninghave significant better F1-score in such few shot settingsFurthermore training a model from scratch using few-shot data may result in abnormal increase in precision(eg MPL at 116 and 132 NP at 132 PD at 132) or inrecall (eg NP at 116) compared with training the modelfrom scratch using more data This abnormal increase inprecision (or recall) comes with a sharp decrease in recall(or precision) Our analysis shows that this phenomenonis caused by the biased training data in few-shot settingsIn contrast the target-library models obtained throughtransfer learning can reuse knowledge in the source modeland produce a much more balanced precision and recall(thus much better F1-score) even in the face of biased few-shot training data

bull The effectiveness of transfer learning does not correlate withthe quality of source-library-trained model Based on a not-so-high-quality source model (eg NumPy) we can still ob-tain a high-quality target model through transfer learningFor example NP rarr PD results in a Pandas model with F1-score gt 80 using only 18 of Pandas training data On theother hand a high-quality source model (eg Pandas) maynot boost the performance of the target model Amongall 36 transfer settings we have one such case ie PDrarr MPL at 11 The Matplotlib model transferred fromthe Pandas model using all Matplotlib training data isslightly worse than the Matplotlib model trained fromscratch using all training data This can be attributed tothe differences of API naming characteristics between thetwo libraries (see the last bullet)

bull The more similar the functionalities and the API-namingmention characteristics between the two libraries arethe more effectiveness the transfer learning can be Pan-das and NumPy have similar functionalities and API-namingmention characteristics For PD rarr NP and NPrarr PD the target-library model is less than 5 worse inF1-score with 18 target-library training data comparedwith the target-library model trained from scratch using

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 12

all training data In contrast PD rarr MPL has 69 dropin F1-score with 12 training data and NP rarr MPL has107 drop in F1-score with 14 training data comparedwith the Matplotlib model trained from scratch using alltraining data

bull Knowledge about non-polysemous APIs seems easy to expandwhile knowledge about polysemous APIs seems difficult toadapt Matplotlib has the least polysemous APIs (16)and Pandas and NumPy has much more polysemous APIsBoth MPL rarr PD and MPL rarr NP have high-qualitytarget models even with 18 target-library training dataThe transfer learning seems to be able to add the newknowledge ldquocommon words can be APIsrdquo to the target-library model In contrast NP rarr MPL requires at 12training data to have a high-quality target model (F1-score8316) and for PD rarr MPL the Matplotlib model (F1-score8058) by transfer learning with all training data is evenslightly worse than the model trained from scratch (F1-score 827) These results suggest that it can be difficultto adapt the knowledge ldquocommon words can be APIsrdquolearned from NumPy and Pandas to Matplotlib which doesnot have many common-word APIs

Transfer learning can effectively boost the performance of target-library model with less demand on target-library training dataIts effectiveness correlates more with the similarity of libraryfunctionalities and and API-namingmention characteristicsthan the initial quality of source-library model

65 Effectiveness of Across-Language Transfer Learn-ing (RQ4)Motivation In the RQ3 we investigate the transferabilityof API extraction model across three Python libraries Inthe RQ4 we want to further investigate the transferabilityof API extraction model in a more challenging setting ieacross different programming languages and across librarieswith very different functionalitiesApproach For the source language we choose Python oneof the most popular programming languages We randomlyselect 200 posts in each dataset of the three Python librariesand combine these 600 posts as the training data for PythonAPI extraction model For the target languages we choosethree other popular programming languages Java JavaScriptand C That is we have three transfer-learning pairs in thisRQ Python rarr Java Python rarr JavaScript and Python rarrC For the three target languages we intentionally choosethe libraries that support very different functionalities fromthe three Python libraries For Java we choose JDBC (anAPI for accessing relational database) For JavaScript wechoose React (a library for web graphical user interface)For C we choose OpenGL (a library for 3D graphics) Asdescribed in Section 612 we label 600 posts for each target-language library for the experiments As in the RQ3 weuse gradually-reduced target-language training data to fine-tune the source-language-trained modelResults Table 10 Table 11 and Table 12 show the experimentresults for the three across-language transfer-learning pairsThese three tables use the same notation as Table 6 Table 7and Table 8 We can see thatbull Directly reusing the source-language-trained model on the

target-language text produces unacceptable performance For

TABLE 10 Python rarr Java

PythonrarrJava JavaPrec Recall F1 Prec Recall F1

11 7745 6656 7160 8469 5531 669212 7220 6250 6706 7738 5344 632214 6926 5000 5808 7122 4719 567718 5000 6406 5616 7515 3969 5194

116 5571 4825 5200 7569 3406 4698132 5699 4000 4483 7789 2312 3566DU 4444 2875 3491

TABLE 11 Python rarr JavaScript

PythonrarrJavaScript JavaScriptPrec Recall F1 Prec Recall F1

11 7745 8175 8619 8795 7817 827712 8693 7659 8143 8756 5984 777014 8684 6865 7424 8308 6626 737218 8148 6111 6984 8598 5595 6788

116 7111 6349 6808 8738 3848 5344132 6667 5238 5867 6521 3571 4615DU 5163 2500 3369

the three Python libraries directly reusing a source-library-trained model on the text of a target-library maystill have reasonable performance for example NP rarrMPL (F1-score 6916) PD rarr MPL (640) MPL rarr NP(6471) However for the three across-language transfer-learning pairs the F1-score of direct model reuse is verylow Python rarr Java (3491) Python rarr JavaScript (3369)Python rarr C (3595) These results suggest that there arestill certain level of commonalities between the librarieswith similar functionalities and of the same programminglanguage so that the knowledge learned in the source-library model may be largely reused in the target-librarymodel In contrast the libraries of different functionali-ties and programming languages have much fewer com-monalities which makes it infeasible to directly deploya source-language-trained model to the text of anotherlanguage In such cases we have to either train the modelfor each language or library from scratch or we may ex-ploit transfer learning to adapt a source-language-trainedmodel to the target-language text

bull Across-language transfer learning holds the same performancecharacteristics as within-language transfer learning but itdemands more target-language training data to obtain a high-quality model than within-language transfer learning ForPython rarr JavaScript and Python rarr C with as minimumas 116 target-language training data we can obtaina fine-tuned target-language model whose F1-score candouble that of directly reusing the Python-trained modelto JavaScript or C text For Python rarr Java fine-tuning thePython-trained model with 116 Java training data booststhe F1-score by 50 compared with directly applying thePython-trained model to Java text For all the transfer-learning settings in Table 10 Table 11 and Table 12the fine-tuned target-language model always outperformsthe corresponding target-language model trained fromscratch with the same amount of target-language trainingdata However it requires at least 12 target-languagetraining data to fine-tune a target-language model toachieve the same level of performance as the target-language model trained from scratch with all target-

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 13

TABLE 12 Python rarr C

PythonrarrC CPrec Recall F1 Prec Recall F1

11 8987 8509 8747 8583 8374 847712 8535 8373 8454 8628 7669 812114 8383 7588 7965 8219 7127 763418 7432 7371 7401 7868 6802 7297

116 7581 6945 7260 8852 5710 6942132 6962 6585 6779 8789 4525 5975DU 5649 2791 3595

language training data For the within-language transferlearning achieving this result may require as minimum as18 target-library training data

Software text of different programming languages and of li-braries with very different functionalities increases the difficultyof transfer learning but transfer learning can still effectivelyboost the performance of the target-language model with 12less target-language training data compared with training thetarget-language model from scratch with all training data

66 Threats to Validity

The major threat to internal validity is the API labelingerrors in the dataset In order to decrease errors the first twoauthors first independently label the same data and thenresolve the disagreements by discussion Furthermore inour experiments we have to examine many API extractionresults We only spot a few instances of overlooked orerroneously-labeled API mentions Therefore the quality ofour dataset is trustworthy

The major threat to external validity is the generalizationof our results and findings Although we invest significanttime and effort to prepare datasets conduct experimentsand analyze results our experiments involve only six li-braries of four programming languages The performance ofour neural architecture and especially the findings on trans-fer learning could be different with other programminglanguages and libraries Furthermore the software text inall the experiments comes from a single data source ieStack Overflow Software text from other data sources mayexhibit different API-mention and discussion-context char-acteristics which may affect the model performance In thefuture we will reduce this threat by applying our approachto more languageslibraries and informal software text fromother data sources (eg developer emails)

7 RELATED WORK

APIs are the core resources for software development andAPI related knowledge is widely present in software textssuch as API reference documentation QampA discussionsbug reports Researchers have proposed many API extrac-tion methods to support various software engineering tasksespecially document traceability recovery [42] [47] [48][49] [50] For example RecoDoc [3] extracts Java APIsfrom several resources and then recover traceability acrossdifferent sources Subramanian et al [4] use code contextinformation to filter candidate APIs in a knowledge basefor an API mention in a partial code fragment Christophand Robillard [51] extracts sentences about API usage from

Stack Overflow and use them to augment API referencedocumentation These works focus mainly on the API link-ing task ie linking API mentions in text or code to someAPI entities in a knowledge base In terms of extracting APImentions in software text they rely on rule-based methodsfor example regular expressions of distinct orthographicfeatures of APIs such as camelcase special characters (eg or ()) and API annotations

Several studies such as Bacchelli et al [42] [43] and Yeet al [2] show that regular expressions of distinct ortho-graphic features are not reliable for API extraction tasks ininformal texts such as emails Stack Overflow posts Bothvariations of sentence formats and the wide presence ofmentions of polysemous API simple names pose a greatchallenge for API extraction in informal texts Howeverthis challenge is generally avoided by considering onlyAPI mentions with distinct orthographic features in existingworks [3] [51] [52]

Island parsing provides a more robust solution for ex-tracting API mentions from texts Using an island grammarwe can separate the textual content into constructs of in-terest (island) and the remainder (water) [53] For exampleBacchelli et al [54] uses island parsing to extract code frag-ments from natural language text Rigby and Robillard [52]also use island parser to identify code-like elements that canpotentially be APIs However these island parsers cannoteffectively deal with mentions of API simple names that arenot suffixed by () such as apply and series in Fig 1 whichare very common writing forms of API methods in StackOverflow discussions [2]

Recently Ye et al [10] proposes a CRF based approachfor extracting methods of software entities such as pro-gramming languages libraries computing concepts andAPIs from informal software texts They report that extractAPI mentions is much more challenging than extractingother types of software entities A follow-up work by Yeet al [2] proposes to use a combination of orthographicword-clusters and API gazetteer features to enhance theCRFrsquos performance on API extraction tasks Although theseproposed features are effective they require much manualeffort to develop and tune Furthermore enough trainingdata has to be prepared and manually labeled for applyingtheir approach to the software text of each library to beprocessed These overheads pose a practical limitation todeploying their approach to a larger number of libraries

Our neural architecture is inspired by recent advancesof neural network techniques for NLP tasks For exampleboth RNNs and CNNs have been used to embed character-level features for question answering [55] [56] machinetranslation [57] text classification [58] and part-of-speechtagging [59] Some researchers also use word embeddingsand LSTMs for NER [35] [60] [61] To the best of our knowl-edge our neural architecture is the first machine learningbased API extraction method that combines these proposalsand customize them based on the characteristics of soft-ware texts and API names Furthermore the design of ourneural architecture also takes into account the deploymentoverhead of the API methods for multiple programminglanguages and libraries which has never been explored forAPI extraction tasks

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 14

8 CONCULSION

This paper presents a novel multi-layer neural architecturethat can effectively learn important character- word- andsentence-level features from informal software texts for ex-tracting API mentions in text The learned features has supe-rior performance than human-defined orthographic featuresfor API extraction in informal software texts This makes ourneural architecture easy to deploy because the only input itrequires is the texts to be processed In contrast existingmachine learning based API extraction methods have touse additional hand-crafted features such as word clustersor API gazetteers in order to achieve the performanceclose to that of our neural architecture Furthermore as thefeatures are automatically learned from the input texts ourneural architecture is easy to transfer and fine-tune acrossprogramming languages and libraries We demonstrate itstransferability across three Python libraries and across fourprogramming languages Our neural architecture togetherwith transfer learning makes it easy to train and deploya high-quality API extraction model for multiple program-ming languages and libraries with much less overall effortrequired for preparing training data and effective featuresIn the future we will further investigate the performanceand the transferability of our neural architecture in manyother programming languages and libraries moving to-wards real-world deployment of machine learning basedAPI extraction methods

REFERENCES

[1] C Parnin C Treude L Grammel and M-A Storey ldquoCrowddocumentation Exploring the coverage and the dynamics of apidiscussions on stack overflowrdquo

[2] D Ye Z Xing C Y Foo J Li and N Kapre ldquoLearning to extractapi mentions from informal natural language discussionsrdquo in Soft-ware Maintenance and Evolution (ICSME) 2016 IEEE InternationalConference on IEEE 2016 pp 389ndash399

[3] B Dagenais and M P Robillard ldquoRecovering traceability linksbetween an api and its learning resourcesrdquo in Software Engineering(ICSE) 2012 34th International Conference on IEEE 2012 pp 47ndash57

[4] S Subramanian L Inozemtseva and R Holmes ldquoLive api doc-umentationrdquo in Proceedings of the 36th International Conference onSoftware Engineering ACM 2014 pp 643ndash652

[5] C Chen Z Xing and X Wang ldquoUnsupervised software-specificmorphological forms inference from informal discussionsrdquo in Pro-ceedings of the 39th International Conference on Software EngineeringIEEE Press 2017 pp 450ndash461

[6] X Chen C Chen D Zhang and Z Xing ldquoSethesaurus Wordnetin software engineeringrdquo IEEE Transactions on Software Engineer-ing 2019

[7] H Li S Li J Sun Z Xing X Peng M Liu and X Zhao ldquoIm-proving api caveats accessibility by mining api caveats knowledgegraphrdquo in Software Maintenance and Evolution (ICSME) 2018 IEEEInternational Conference on IEEE 2018

[8] Z X D L X W Qiao Huang Xin Xia ldquoApi method recom-mendation without worrying about the task-api knowledge gaprdquoin Automated Software Engineering (ASE) 2018 33th IEEEACMInternational Conference on IEEE 2018

[9] B Xu Z Xing X Xia and D Lo ldquoAnswerbot automated gen-eration of answer summary to developersz technical questionsrdquoin Proceedings of the 32nd IEEEACM International Conference onAutomated Software Engineering IEEE Press 2017 pp 706ndash716

[10] D Ye Z Xing C Y Foo Z Q Ang J Li and N Kapre ldquoSoftware-specific named entity recognition in software engineering socialcontentrdquo in 2016 IEEE 23rd International Conference on SoftwareAnalysis Evolution and Reengineering (SANER) vol 1 IEEE 2016pp 90ndash101

[11] E F Tjong Kim Sang and F De Meulder ldquoIntroduction to the conll-2003 shared task Language-independent named entity recogni-tionrdquo in Proceedings of the seventh conference on Natural languagelearning at HLT-NAACL 2003-Volume 4 Association for Compu-tational Linguistics 2003 pp 142ndash147

[12] R Caruana ldquoLearning many related tasks at the same time withbackpropagationrdquo in Advances in neural information processing sys-tems 1995 pp 657ndash664

[13] A Arnold R Nallapati and W W Cohen ldquoExploiting featurehierarchy for transfer learning in named entity recognitionrdquo Pro-ceedings of ACL-08 HLT pp 245ndash253 2008

[14] J Yosinski J Clune Y Bengio and H Lipson ldquoHow transferableare features in deep neural networksrdquo in Advances in neuralinformation processing systems 2014 pp 3320ndash3328

[15] Z X Deheng Ye Lingfeng Bao and S-W Lin ldquoApireal An apirecognition and linking approach for online developer forumsrdquoEmpirical Software Engineering 2018 2018

[16] C dos Santos and M Gatti ldquoDeep convolutional neural networksfor sentiment analysis of short textsrdquo in Proceedings of COLING2014 the 25th International Conference on Computational LinguisticsTechnical Papers 2014 pp 69ndash78

[17] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[18] G Chen C Chen Z Xing and B Xu ldquoLearning a dual-languagevector space for domain-specific cross-lingual question retrievalrdquoin 2016 31st IEEEACM International Conference on Automated Soft-ware Engineering (ASE) IEEE 2016 pp 744ndash755

[19] C Chen X Chen J Sun Z Xing and G Li ldquoData-driven proac-tive policy assurance of post quality in community qampa sitesrdquoProceedings of the ACM on human-computer interaction vol 2 noCSCW p 33 2018

[20] I Bazzi ldquoModelling out-of-vocabulary words for robust speechrecognitionrdquo PhD dissertation Massachusetts Institute of Tech-nology 2002

[21] C Chen S Gao and Z Xing ldquoMining analogical libraries in qampadiscussionsndashincorporating relational and categorical knowledgeinto word embeddingrdquo in 2016 IEEE 23rd international conferenceon software analysis evolution and reengineering (SANER) vol 1IEEE 2016 pp 338ndash348

[22] C Chen and Z Xing ldquoSimilartech automatically recommendanalogical libraries across different programming languagesrdquo in2016 31st IEEEACM International Conference on Automated SoftwareEngineering (ASE) IEEE 2016 pp 834ndash839

[23] C Chen Z Xing and Y Liu ldquoWhats spains paris mining analogi-cal libraries from qampa discussionsrdquo Empirical Software Engineeringvol 24 no 3 pp 1155ndash1194 2019

[24] Y Huang C Chen Z Xing T Lin and Y Liu ldquoTell them apartdistilling technology differences from crowd-scale comparisondiscussionsrdquo in ASE 2018 pp 214ndash224

[25] O Levy and Y Goldberg ldquoNeural word embedding as implicitmatrix factorizationrdquo in Advances in neural information processingsystems 2014 pp 2177ndash2185

[26] D Tang F Wei N Yang M Zhou T Liu and B Qin ldquoLearningsentiment-specific word embedding for twitter sentiment classifi-cationrdquo in Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) vol 1 2014pp 1555ndash1565

[27] O Levy Y Goldberg and I Dagan ldquoImproving distributional sim-ilarity with lessons learned from word embeddingsrdquo Transactionsof the Association for Computational Linguistics vol 3 pp 211ndash2252015

[28] J Pennington R Socher and C Manning ldquoGlove Global vectorsfor word representationrdquo in Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP) 2014 pp1532ndash1543

[29] I Sutskever O Vinyals and Q V Le ldquoSequence to sequencelearning with neural networksrdquo in Advances in neural informationprocessing systems 2014 pp 3104ndash3112

[30] C Chen Z Xing and Y Liu ldquoBy the community amp for the com-munity a deep learning approach to assist collaborative editing inqampa sitesrdquo Proceedings of the ACM on Human-Computer Interactionvol 1 no CSCW p 32 2017

[31] S Gao C Chen Z Xing Y Ma W Song and S-W Lin ldquoA neuralmodel for method name generation from functional descriptionrdquo

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 15

in 2019 IEEE 26th International Conference on Software AnalysisEvolution and Reengineering (SANER) IEEE 2019 pp 414ndash421

[32] X Wang C Chen and Z Xing ldquoDomain-specific machine trans-lation with recurrent neural network for software localizationrdquoEmpirical Software Engineering pp 1ndash32 2019

[33] C Chen Z Xing Y Liu and K L X Ong ldquoMining likelyanalogical apis across third-party libraries via large-scale unsu-pervised api semantics embeddingrdquo IEEE Transactions on SoftwareEngineering 2019

[34] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no 11pp 2673ndash2681 1997

[35] Z Huang W Xu and K Yu ldquoBidirectional lstm-crf models forsequence taggingrdquo arXiv preprint arXiv150801991 2015

[36] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural computation vol 9 no 8 pp 1735ndash1780 1997

[37] H Sak A Senior and F Beaufays ldquoLong short-term memoryrecurrent neural network architectures for large scale acousticmodelingrdquo in Fifteenth annual conference of the international speechcommunication association 2014

[38] Y Bengio ldquoDeep learning of representations for unsupervised andtransfer learningrdquo in Proceedings of ICML Workshop on Unsupervisedand Transfer Learning 2012 pp 17ndash36

[39] S J Pan Q Yang et al ldquoA survey on transfer learningrdquo[40] N Srivastava G Hinton A Krizhevsky I Sutskever and

R Salakhutdinov ldquoDropout A simple way to prevent neural net-works from overfittingrdquo The Journal of Machine Learning Researchvol 15 no 1 pp 1929ndash1958 2014

[41] D P Kingma and J Ba ldquoAdam A method for stochastic optimiza-tionrdquo arXiv preprint arXiv14126980 2014

[42] A Bacchelli M Lanza and R Robbes ldquoLinking e-mails andsource code artifactsrdquo in Proceedings of the 32nd ACMIEEE Inter-national Conference on Software Engineering-Volume 1 ACM 2010pp 375ndash384

[43] A Bacchelli M DrsquoAmbros M Lanza and R Robbes ldquoBench-marking lightweight techniques to link e-mails and source coderdquoin Reverse Engineering 2009 WCRErsquo09 16th Working Conference onIEEE 2009 pp 205ndash214

[44] S J Pan and Q Yang ldquoA survey on transfer learningrdquo IEEETransactions on knowledge and data engineering vol 22 no 10 pp1345ndash1359 2010

[45] M Oquab L Bottou I Laptev and J Sivic ldquoLearning and transfer-ring mid-level image representations using convolutional neuralnetworksrdquo in Computer Vision and Pattern Recognition (CVPR) 2014IEEE Conference on IEEE 2014 pp 1717ndash1724

[46] S Ravi and H Larochelle ldquoOptimization as a model for few-shotlearningrdquo 2016

[47] A Marcus J Maletic et al ldquoRecovering documentation-to-source-code traceability links using latent semantic indexingrdquo in Soft-ware Engineering 2003 Proceedings 25th International Conference onIEEE 2003 pp 125ndash135

[48] H-Y Jiang T N Nguyen X Chen H Jaygarl and C K ChangldquoIncremental latent semantic indexing for automatic traceabil-ity link evolution managementrdquo in Proceedings of the 2008 23rdIEEEACM International Conference on Automated Software Engineer-ing IEEE Computer Society 2008 pp 59ndash68

[49] W Zheng Q Zhang and M Lyu ldquoCross-library api recommen-dation using web search enginesrdquo in Proceedings of the 19th ACMSIGSOFT symposium and the 13th European conference on Foundationsof software engineering ACM 2011 pp 480ndash483

[50] Q Gao H Zhang J Wang Y Xiong L Zhang and H Mei ldquoFixingrecurring crash bugs via analyzing qampa sites (t)rdquo in AutomatedSoftware Engineering (ASE) 2015 30th IEEEACM International Con-ference on IEEE 2015 pp 307ndash318

[51] C Treude and M P Robillard ldquoAugmenting api documentationwith insights from stack overflowrdquo in Software Engineering (ICSE)2016 IEEEACM 38th International Conference on IEEE 2016 pp392ndash403

[52] P C Rigby and M P Robillard ldquoDiscovering essential codeelements in informal documentationrdquo in Proceedings of the 2013International Conference on Software Engineering IEEE Press 2013pp 832ndash841

[53] L Moonen ldquoGenerating robust parsers using island grammarsrdquoin Reverse Engineering 2001 Proceedings Eighth Working Conferenceon IEEE 2001 pp 13ndash22

[54] A Bacchelli A Cleve M Lanza and A Mocci ldquoExtracting struc-tured data from natural language documents with island parsingrdquo

in Automated Software Engineering (ASE) 2011 26th IEEEACMInternational Conference on IEEE 2011 pp 476ndash479

[55] Y Kim Y Jernite D Sontag and A M Rush ldquoCharacter-awareneural language modelsrdquo in AAAI 2016 pp 2741ndash2749

[56] D Lukovnikov A Fischer J Lehmann and S Auer ldquoNeuralnetwork-based question answering over knowledge graphs onword and character levelrdquo in Proceedings of the 26th internationalconference on World Wide Web International World Wide WebConferences Steering Committee 2017 pp 1211ndash1220

[57] W Ling I Trancoso C Dyer and A W Black ldquoCharacter-basedneural machine translationrdquo arXiv preprint arXiv151104586 2015

[58] X Zhang J Zhao and Y LeCun ldquoCharacter-level convolutionalnetworks for text classificationrdquo in Advances in neural informationprocessing systems 2015 pp 649ndash657

[59] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[60] J P Chiu and E Nichols ldquoNamed entity recognition with bidirec-tional lstm-cnnsrdquo arXiv preprint arXiv151108308 2015

[61] G Lample M Ballesteros S Subramanian K Kawakami andC Dyer ldquoNeural architectures for named entity recognitionrdquoarXiv preprint arXiv160301360 2016

Suyu Ma is a research assistant in the Facultyof Information Technology at Monash UniversityHe has research interest in the areas of soft-ware engineering Deep learning and Human-computer Interaction He is currently focusing onimproving the usability and accessibility of mo-bile application He received the BS and MSdegrees from Beijing Technology and BusinessUniversity and the Australian National Universityin 2016 and 2018 respectively And he will bea PhD student at Monash University in 2020

under the supervision of Chunyang Chen

Zhenchang Xing is a Senior Lecturer in the Re-search School of Computer Science AustralianNational University Previously he was an Assis-tant Professor in the School of Computer Sci-ence and Engineering Nanyang TechnologicalUniversity Singapore from 2012-2016 Beforejoining NTU Dr Xing was a Lee Kuan Yew Re-search Fellow in the School of Computing Na-tional University of Singapore from 2009-2012Dr Xings current research area is in the interdis-plinary areas of software engineering human-

computer interaction and applied data analytics Dr Xing has over 100publications in peer-refereed journals and conference proceedings andhave received several distinguished paper awards from top softwareengineering conferences Dr Xing regularly serves on the organizationand program committees of the top software engineering conferencesand he will be the program committee co-chair for ICSME2020

Chunyang Chen is a lecturer (Assistant Pro-fessor) in Faculty of Information TechnologyMonash University Australia His research fo-cuses on software engineering deep learningand human-computer interaction He has pub-lished over 16 papers in referred journals orconferences including Empirical Software Engi-neering ICSE ASE CSCW ICSME SANERHe is a member of IEEE and ACM He receivedACM SIGSOFT distinguished paper award inASE 2018 best paper award in SANER 2016

and best tool demo in ASE 2016 httpschunyang-chengithubio

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 16

Cheng Chen received his Bachelor degree inSoftware Engineering from Northwest University(China) and received Master degree of Com-puting (major in Artificial Intelligence) from theAustralian National University He did some in-ternship projects about Natural Language Pro-cessing at CSIRO from 2016 to 2017 He workedas a research assistant supervised by Dr Zhen-chang Xing at ANU in 2018 Cheng currentlyworks in PricewaterhouseCoopers (PwC) Firmas a senior algorithm engineer of Natural Lan-

guage Processing and Data Analyzing Cheng is interested in NamedEnti ty Extraction Relation Extraction Text summarization and parallelcomputing He is working on knowledge engineering and transaction forNLP tasks

Lizhen Qu is a research fellow in the DialogueResearch lab in the Faculty of Information Tech-nology at Monash University He has exten-sive research experience in the areas of naturallanguage processing multimodal learning deeplearning and Cybersecurity He is currently fo-cusing on information extraction semantic pars-ing and multimodal dialogue systems Prior tojoining Monash University he worked as a re-search scientist at Data61CSIRO where he ledand participated in several research and indus-

trial projects including Deep Learning for Text and Deep Learning forCyber

Guoqiang Li is now an associate professor inschool of software Shanghai Jiao Tong Univer-sity and a guest associate professor in KyushuUniversity He received the BS MS and PhDdegrees from Taiyuan University of TechnologyShanghai Jiao Tong University and Japan Ad-vanced Institute of Science and Technology in2001 2005 and 2008 respectively He workedas a postdoctoral research fellow in the graduateschool of information science Nagoya Univer-sity Japan during 2008-2009 as an assistant

professor in the school of software Shanghai Jiao Tong Universityduring 2009-2013 and as an academic visitor in the department ofcomputer science University of Oxford during 2015-2016 His researchinterests include formal verification programming language theory anddata analytics and intelligence He published more than 40 researchespapers in the international journals and mainstream conferences includ-ing TDSC SPE TECS IJFCS SCIS CSCW ICSE FORMATS ATVAetc

Page 4: JOURNAL OF LA Easy-to-Deploy API Extraction by Multi-Level ... · sufficient high-quality training data for APIs of some less frequently discussed libraries or frameworks. Another

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 4

Fig 2 Word Clouds of the Top 50 Frequent Words in the Discussions of the Six Libraries

pandasDataFrameapply() APIs may also be mentioned intheir simple names such as apply series dataframe that canalso be common English words or computing terms in thetext Furthermore we do not assume that API mentions willbe consistently annotated with special tags Therefore ourapproach takes plain text as input

A related task to our work is API linking API extractionmethods classify whether a token in text is an API mentionor not while API linking methods link API mentions intext to API entities in a knowledge base [15] That is APIextraction is the prerequsite for API linking This work dealswith only API extraction

4 OUR NEURAL ARCHITECTURE

We formulate the task of API extraction in informal soft-ware texts as a sequence labeling problem and presenta neural architecture that labels each token in an inputtext as API or non-API As shown in Fig 3 our neuralarchitecture is composed of four main components 1) acharacter-level Convolutional Neural Network (CNN) forextracting character-level features of a token (Section 41)2) an unsupervised word embedding model for learningword semantics of a token (Section 42) 3) a BidirectionalLong Short-Term Memory network (Bi-LSTM) for extractingsentence-context features (Section 43) and 4) a softmaxclassifier for predicting the API or non-API label of a token(Section 44) Our neural model can be trained end-to-endwith pairs of input texts and their corresponding APInon-API label sequences A model trained with one libraryrsquos textcan be transferred to another libraryrsquos text by fine-tuning

Fig 3 Our Neural Architecture for API Extraction

source-library-trained components with the target libraryrsquostext (Section 45)

41 Extracting Char-Level Features by CNN

API mentions often have morphological features that dis-tinguish them from normal English words Such morpho-logical features may appear in the beginning of a token(eg the first letter capitalization System) in the middle(eg the hyphen in read csv the dot in pandasseries the left

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 5

Fig 4 Our Character Vocabulary

Fig 5 Character-Level CNN

parenthesis and comma in print(a32)) or at the end (egthe right parenthesis in apply()) Morphological features mayalso appear in combination such as camelcase writing likeAbstractAction a pair of parentheses like plot() The longlength of some tokens like createdataset is one importantmorphological feature as well Due to the lack of universalnaming convention across libraries and the wide presenceof informal writing forms informative morphological fea-tures of API mentions often vary from one libraryrsquos text toanother

Robust methods to extract morphological features fromtokens must take into account all characters of the tokenand determine which features are more important for a par-ticular libraryrsquos APIs [16] To that end we use a character-level CNN [17] which extracts local features in N-gramcharacters of the token using a convolution operation andthen combines them using a max-pooling operation to createa fixed-sized character-level embedding of the token [18][19]

Let V char be the vocabulary of characters for the soft-ware texts from which we want to extract API mentionsIn this work V char for all of our models consists of 92characters including 26 English letters (both upper andlower case) 10 digits 30 other characters (eg rsquorsquorsquo[rsquorsquorsquo)as listed in Fig 4 Note that V char can be easily extendedfor different datasets Let Echar isin Rdchartimes|V char| be thecharacter embedding matrix where dchar is the dimension ofcharacter embeddings and |V char| is the vocabulary size (92in this work) As illustrated in Fig 3 Echar can be regardedas a dictionary of character embeddings in which a columndchar-dimensional vector corresponds to a particular char-acter The character embeddings are initialized as one-hotvectors and then learned during the training of character-level CNN Given a character c isin V char its embedding ec

can be retrieved by the matrix-vector product ec = Echarvc

where vc is a one-hot vector of size |V char| which has value1 at index c and zero in all other positions

Fig 5 presents the architecture of our character-levelCNN Given a token w in the input text letrsquos assume wis composed of M characters c1 c2 cM We first obtaina sequence of character embeddings ec1 ec2 ecM by

looking up the character embeddings matrix Echar Thissequence of character embeddings (zero-padding at thebeginning and the end of the sequence) is the input to ourcharacter-level CNN In our application of CNN becauseeach character is represented as a dchar-dimensional vectorwe use convolution filters with widths equal to the dimen-sionality of the character embeddings (ie dchar) Then wecan vary the height h (or window size) of the filter iethe number of adjacent characters considered jointly in theconvolution operation

Let zm be the concatenation of the character embeddingsof cm (1 le m le M) the (h minus 1)2 left neighbors of cmand the (h minus 1)2 right neighbors of cm A convolutionoperation involves a filter W isin Rhdchar

(a matrix of htimesdchar

dimensions) and a bias term b isin Rh which is appliedrepeatedly to each character window of size h in the inputsequence c1 c2 cM

om = ReLU(WT middot zm + b)

where ReLU(x) = max(0 x) is the non-linear activa-tion function The convolution operations produce a M -dimensional feature map for a particular filter A 1D-maxpooling operation is then applied to the feature map toextract a scalar (ie a feature vector of length 1) with themaximum value in the M dimensions of the feature map

The convolution operation extracts local features withineach character window of the given token and using themax over all character windows of the token we extracta global feature for the token We can apply N filters toextract different features from the same character windowThe output of the character-level CNN is an N -dimensionalfeature vector representing the character-level embedding ofthe given token We denote this embedding as echarw for thetoken w

In our character-level CNN the matrices Echar and W and the vector b are parameters to be learned The dimen-sionality of the character embedding dchar the number offilters N and the window size of the filters h are hyper-parameters to be chosen by the user (see Section 52 formodel configuration)

42 Learning Word Semantics by Word Embedding

In informal software texts the same API is often mentionedin many non-standard abbreviations and synonyms [5] Forexample the Pandas library is often written as pd and itsmodule DataFrame is often written as df Furthermore thereis lack of consistent use of verb noun and preposition inthe discussions [2] For example in the sentences ldquoI havedecided to use apply rdquo ldquoif you run apply on a series rdquoand ldquoI tested with apply rdquo users refer to a Pandasrsquos methodapply() but their descriptions vary greatly

Such variations result in out-of-vocabulary (OOV) issuefor a machine learning model [20] ie variations that havenot been seen in the training data For the easy deploymentof a machine learning model for API extraction it is imprac-tical to address the OOV issue by manually labeling a hugeamount of data and developing a comprehensive gazetteerof API names and their common variations [2] Howeverwithout the knowledge about variations of semantically

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 6

similar words the trained model will be restricted to theexamples that it sees in the training data

To address this dilemma we propose to exploit unsu-pervised word-embedding models [21] [22] [23] [24] tolearn distributed word representations from a large amountof unlabeled text discussing a particular library Manystudies [25] [26] [27] have shown that distributed wordrepresentations can capture rich semantics of words suchthat semantically similar words will have similar wordembeddings

Let V word be the vocabulary of words for the corpusof software texts to be processed As illustrated in Fig 3word-level embeddings are encoded by column vectors ina word embedding matrix Eword isin Rdwordtimes|V word| wheredword is the dimension of word embeddings and |V word|is the vocabulary size Each column Eword

i isin Rdword

corre-sponds to the word-level embedding of the i-th word inthe vocabulary V word We can obtain a token wrsquos word-level embedding eword

w by looking up Eword with the wordw ie eword

w = Ewordvw where vw is a one-hot vector ofsize |V word| which has value 1 at index w and zero in allother positions The matrix Eword is to be learned usingunsupervised word embedding models (eg GloVe [28])and dword is a hyper-parameter to be chosen by the user(see Section 52 for model configuration)

In this work we adopt the Global Vectors for WordRepresentation (GloVe) method [28] to learn the matrixEword GloVe is an unsupervised algorithm for learningword representations based on the statistics of word co-occurrences in an unlabeled text corpus It calculates theword embeddings based on a word co-occurrence matrixX Each row in the word co-occurrence matrix correspondsto a word and each column corresponds to a context Xij

is the frequency of word i co-occurring with word j andXi =

983123Xik (1 le k le |V word|) is the total number of occur-

rences of word i in the corpus The probability of word j thatoccurs in the context of word i is Pij = P (j|i) = XijXiWe have log(Pij) = log(Xij)minus log(Xi)

GloVe defines log(Pij) = eTwiewj where ewi and ewj

are the word embeddings to be learned for the word wi

and wj This gives the constraint for each word pair aslog(Xij) = eTwi

ewj + bi + bj where b is the bias termfor ew The cost function for minimizing the loss of wordembeddings is defined as

V word983131

ij=1

f(Xij)(eTwiewj

+ bi + bj minus log(Xij))

where f(Xij) is a weighting function That is GloVe learnsword embeddings by a weighted least square regressionmodel

43 Extracting Sentence-Context Features by Bi-LSTM

In informal software texts many API mentions cannot be re-liably recognized by simply examining a tokenrsquos character-level features and word semantics This is because manyAPIs are named using common English words (eg seriesapply plot) or common computing terms (eg dataframesigmoid histgram argmax zip list) When such APIs are men-tioned in their simple name this results in a common-word

polesemy issue for API extraction [2] In such situations wehave to disambiguate the API sense of a common word fromthe normal sense of the word

To that end we have to look into the sentence context inwhich an API is mentioned For example by looking intothe sentence context of the two sentences in Fig 1 we candetermine that the ldquoapplyrdquo in the first sentence is an APImention while the ldquoapplyrdquo in the second sentence is not anAPI mention Note that both the preceding and succeedingcontext of the token ldquoapplyrdquo are useful for disambiguatingthe API or the normal sense of the token

We use Recurrent Neural Network (RNN) to extractsentence context features for disambiguating the API or thenormal sense of the word [29] RNN is a class of neuralnetworks where connections between units form directedcycles and it is widely used in software engineering do-main [30] [31] [32] [33] Due to this nature it is especiallyuseful for tasks involving sequential inputs [29] like sen-tences In our task we adopt a Bidirectional RNN (Bi-RNN)architecture [34] [35] which is composed of two LSTMsone takes input from the beginning of the text forward tilla particular token while the other takes input from the endof the text backward till that token

The input to an RNN is a sequence of vectors In ourtask we obtain the input vector of a token in the inputtext by concatenating the character-level embedding of thetoken w and the word-level embedding of the token wie echarw oplus eword

w An RNN recursively maps an inputvector xt and a hidden state htminus1 to a new hidden state htht = f(htminus1 xt) where f is a non-linear activation function(eg an LSTM unit used in this work) A hidden stateis a vector esent isin Rdsent

summarizing the sentence-levelfeatures till the input xt where dsent is the dimension of thehidden state vector to be chosen by the user We denote efsentand ebsent as the hidden states computed by the forwardLSTM and the backward LSTM after reading the end of thepreceding and the succeeding sentence context of a tokenrespectively efsent and ebsent are concatenated into one vectoresentw as the Bi-LSTM output for the corresponding token win the input text

As an input text (eg a Stack Overflow post) can be along text modeling long-range dependencies in the text iscrucial for our task For example a mention of a libraryname at the beginning of a post could be important fordetecting a mention of a method of this library later inthe post Therefore we adopt the LSTM unit [36] [37] inour RNN The LSTM is designed to cope with the gradientvanishing problem in RNN An LSTM unit consists of amemory cell and three gates namely the input outputand forget gates Conceptually the memory cell stores thepast contexts and the input and output gates allow thecell to store contexts for a long period of time Meanwhilesome contexts can be cleared by the forget gate Memorycell and the three gates have weights and bias terms tobe learned during model training Bi-LSTM can extract thecontext feature For instance in the Pandas library sentencerdquoThis can be accomplished quite simply with the DataFramemethod applyrdquo based on the context information from theword method the Bi-LSTM can help our model classifythe word apply as API mention and the learning can be

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 7

transferred across different languages and libraries withtransfer learning

44 API Labeling by Softmax ClassificationGiven the Bi-LSTMrsquos output vector esentw for a token w inthe input text we train a binary softmax classifier to predictwhether the token is an API mention or a normal Englishword That is we have two classes in this work ie API ornon-API Softmax predicts the probability for the j-th classgiven a tokenrsquos vector esentw by

P (j|esentw ) =exp(esentw WT

j )9831232

k=1 exp(esentw WT

k )

where the vectors Wk (k=1 or 2) are parameters to belearned

45 Library Adaptation by Transfer LearningTransfer learning [12] [38] is a machine learning methodthat stores knowledge gained while solving one problemand applying it to a different but related problem It helpsto reduce the amount of training data required for the targetproblem When transfer learning is used in deep learningit has been shown to be beneficial for narrowing down thescope of possible models on a task by using the weights ofa trained model on a different but related task [14] and theshared parameters transfer learning can help model adaptthe shared knowledge in similar context [39]

API extraction tasks for different libraries can be re-garded as a set of different but related tasks Therefore weuse transfer learning to adapt a neural model trained withone libraryrsquos text (source-library-trained model) for anotherlibraryrsquos text (target library) Our neural architecture is com-posed of four main components character-level CNN wordembeddings sentence Bi-LSTM and softmax classifier Wecan transfer all or some source-library-trained model com-ponents to a target library model Without transfer learn-ing the parameters of a target-library model componentwill be randomly initialized and then learned using thetarget-library training data ie trained from scratch Withtransfer learning we use the parameters of source-library-trained model components to initialize the parameters ofthe corresponding target-library model components Aftertransferring the model parameters we can either freeze thetransferred parameters or fine-tune the transferred parame-ters using the target-library training data

5 SYSTEM IMPLEMENTATION

This section describes the current implementation 2 of ourneural architecture for API extraction

51 Preprocessing and Tokenizing Input TextOur current implementation takes as input the content of aStack Overflow post As we want to recognize API mentionswithin natural language sentences we remove stand-alonecode snippets in ltpregtltcodegt tag We then remove allHTML tags in the post content to obtain the plain text

2 httpsgithubcomJOJO201API Extraction

input (see Section 3 for the justification of taking plain textas input) We develop a sentence parser to split the pre-processed post text into sentences by punctuation Follow-ing [2] we develop a software-specific tokenizer to tokenizethe sentences This tokenizer preserves the integrity of code-like tokens For example it treats matplotlibpyplotimshow()as a single token instead of a sequence of 7 tokens ieldquomatplotlibrdquo ldquordquo ldquopyplotrdquo ldquordquo ldquoimshowrdquo ldquo(rdquo ldquo)rdquo producedby general English tokenizers

52 Model ConfigurationWe now describe the hyper-parameter settings used in ourcurrent implementation These hyper-parameters are tunedusing the validation data during model training (see Sec-tion 612 for the description of our dataset) We find thatthe neural model has very similar performance across sixlibraries dataset with the same hyper-parameters Thereforewe keep the same hyper-parameters for the six librariesdataset which can also avoid the difficulty in scaling dif-ferent hyper-parameters

521 Character-level CNNWe set the filter window size h = 3 That is the convolutionoperation extracts local features from 3 adjacent charactersat a time The size of our current character vocabulary V char

is 92 Thus we set the dimensionality of the character em-bedding dchar at 92 We initialize the character embeddingwith one-hot vector (1 at one character index and zeroin all other dimensions) The character embeddings willbe updated through back propagation during the trainingof character-level CNN We experiment 5 different N (thenumber of filters) 20 40 60 80 100 With N = 40 theCNN has an acceptable performance on the validation dataWith N = 60 and above the CNN has almost the sameperformance as N = 40 but it takes more training epochsto research the acceptable performance Therefore we useN = 40 That is the character-level embedding of a tokenhas the dimension 40

522 Pre-trained word embeddingsOur experiments involve six Python libraries Pandas Mat-plotlib NumPy OpenGL React and JDBC We collect allquestions tagged with these six libraries and all answers tothese questions in the Stack Overflow Data Dump releasedon March 18 2018 We obtain a text corpus of 380971 postsWe use the same preprocessing and tokenization steps asdescribed in Section 51 to preprocess and tokenize thecontent of these posts Then we use the GloVe [28] to learnword embeddings from this text corpus We set the wordvocabulary size |V word| at 40000 The training epoch is setat 100 to ensure the sufficient training of word embeddingsWe experiment four different dimensions of word embed-dings dword 50 100 200 400 We use dword = 200 inour current implementation as it produces a good balanceof the training efficiency and the quality of word embed-dings for API extraction on the validation data We alsoexperiment pre-trained word embeddings with all StackOverflow posts which does not significantly affect the APIextraction performance but requires much longer time fortext preprocessing and word embeddings learning

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 8

523 Sentence-context Bi-LSTMFor the RNN we use 50 hidden LTSM units to store thehidden states The dimension of the hidden state vector is50 Therefore the dimension of the output vector esentw is100 (concatenating forward and backward hidden states) Inorder to mitigate overfitting [40] a dropout layer is addedon the output of BLSTM with the dropout rate 05

53 Model TrainingTo train our neural model we use input text and itscorresponding sequence of APInon-API labels (see Sec-tion 612 for our data labeling process) The optimizerused is Adam [41] which performs well for training RNNincluding Bi-LSTM [35] The training epoch is set at 40 timesThe best performance model on the validation set is savedfor testing This modelrsquos parameters are also saved for thetransfer learning experiments

6 EXPERIMENTS

We conduct a series of experiments to answer the followingfour research questionsbull RQ1 How well can our neural architecture learn multi-

level features from the input texts Can the learned fea-tures support high-quality API extraction compared withexisting machine learning based API extraction methods

bull RQ2 What is the impact of the three feature extractors(character-level CNN word embeddings and sentence-context Bi-LSTM) on the API extraction performance

bull RQ3 How effective is transfer learning for adapting APIextraction models across libraries of the same languagewith different amount of target-library training data

bull RQ4 How effective is transfer learning for adapting APIextraction models across libraries of different languageswith different amount of target-library training data

61 Experiments SetupThis section describes the libraries used in our experimentshow we prepare training and testing data and the evalua-tion metrics of API extraction performance

611 Studied librariesOur experiments involve three Python libraries (PandasNumPy and Matplotlib) one Java library (JDBC) oneJavaScript library (React) and one C library (OpenGL) Asreported in Section 2 these six libraries support very di-verse functionalities for computer programming and havedistinct API-naming and API mention characteristics Usingthese libraries we can evaluate the effectiveness of ourneural architecture for API extraction in very diverse datasettings We can also gain insights into the effectiveness oftransfer learning for API extraction and the transferabilityof our neural architecture in different settings

612 DatasetWe collect Stack Overflow posts (questions and their an-swers) tagged with pandas numpy matplotlib opengl reactor jdbc as our experimental data We use Stack OverflowData Dump released on March 18 2018 In this data dump

TABLE 3 Basic Statistics of Our Dataset

Library Posts Sentences API mentions TokensMatplotlib 600 4920 1481 47317

Numpy 600 2786 1552 39321Pandas 600 3522 1576 42267Opengl 600 3486 1674 70757JDBC 600 4205 1184 50861React 600 3110 1262 42282Total 3600 22029 8729 292805

380971 posts are tagged with one of the six studied librariesOur data collection process follows four criteria First thenumber of posts selected and the number of API mentionsin these posts should be at the same order of magnitudeSecond API mentions should exhibit the variations of APIwriting forms commonly seen on Stack Overflow Third theselection should avoid repeatedly selecting frequently men-tioned APIs Fourth the same post should not appear in thedataset for different libraries (one post may be tagged withtwo or more studied-library tags) The first criterion ensuresthe fairness of comparison across libraries the second andthird criteria ensure the representativeness of API mentionsin the dataset and the fourth criterion ensures that there isno repeated data in different datasets

We finally include 3600 posts (600 for each library) in ourdataset and each post has at least one API mention3 Table 3summarizes the basic statistics of our dataset These postshave in total 22029 sentences 292805 token occurrencesand 41486 unique tokens after data preprocessing and to-kenization We manually label the API mentions in theselected posts 6421 sentences (3407) contains at least oneAPI mention Our dataset has in total 8729 API mentionsThe selected Matplotlib NumPy Pandas OpenGL JDBC andReact posts have 1481 1552 1576 1674 1184 and 1262 APImentions which refer to 553 694 293 201 282 and 83unique APIs of the six libraries respectively In additionwe randomly sample 100 labelled Stack Overflow posts foreach library dataset Then we examine each API mention todetermine whether it is a simple name or not and we findthat among the 600 posts there are 1634 API mentions ofwhich 694 (4247) are simple API names Our dataset notonly contains a large number of API mentions in diversewriting forms but also contains rich discussion contextaround API mentions This makes it suitable for the trainingof our neural architecture the testing of its performanceand the study of model transferability

We randomly split the 600 posts of each library into threesubsets by 622 ratio 60 as training data 20 as validationdata for tuning model hyper-parameters and 20 as testingdata to answer our research questions Since cross validationwill cost a great deal of time and our training dataset is un-biased (proved in Section 64) cross validation is not usedOur experiment results also show that the training datasetis enough to train a high-performance neural networkBesides the performance of the model in target library canbe improved by adding more training dataset from otherlibraries And if more training dataset is added there is noneed to label a new validate dataset for the target libraryto re-tune the hyper-parameters because we find that the

3 httpsdrivegooglecomdrivefolders1f7ejNVUsew9l9uPCj4Xv5gMzNbqttpoausp=sharing

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 9

TABLE 4 Comparison of CRF Baselines [2] and Our Neural Model for API Extraction

Basic CRF Full CRF Our method

Library Prec Recall F1 Prec Recall F1 Prec Recall F1

Matplotlib 6735 4784 5662 8962 6131 7282 815 8392 827NumPy 7583 3991 5229 8921 4931 6351 7824 6291 6974Pandas 6206 7049 6601 9711 7121 8216 8273 853 8280Opengl 4391 9187 5942 9457 7062 8085 8583 8374 8477JDBC 1583 8261 2658 8732 5140 6471 8469 5531 6692React 1674 8858 2816 9742 7011 8153 8795 7817 8277

neural model has very similar performance with the samehyper-parameters under different dataset settings

613 Evaluation metricsWe use precision recall and F1-score to evaluate the per-formance of an API extraction method Precision measureswhat percentage the recognized API mentions by a methodare correct Recall measures what percentage the API men-tions in the testing dataset are recognized correctly by amethod F1-score is the harmonic mean of precision andrecall ie 2 lowast ((precision lowast recall)(precision+ recall))

62 Performance of Our Neural Model (RQ1)Motivation Recently linear-chain CRF has been used tosolve the API extraction problem in informal text [2] Theyshow that machine-learning based API extraction with thatof several outperforms several commonly-used rule-basedmethods [2] [42] [43] The approach in [2] relies on human-defined orthographic features two different unsupervisedlanguage models (class-based Brown clustering and neural-network based word embedding) and API gazetteer features(API inventory) In contrast our neural architecture usesneural networks to automatically learn character- word-and sentence-level features from the input texts The firstRQ is to confirm the effectiveness of these neural-networkfeature extractors for API extraction tasks by comparingthe overall performance of our neural architecture with theperformance of the linear-chain CRF with human-definedfeatures for API extractionApproach We use the implementation of the CRF modelin [2] to build the two CRF baselines The basic CRF baselineuses only the orthographic features developed in [2] Thefull CRF baseline uses all orthographic word-clusters andAPI gazetteer features developed in [2] The basic CRF base-line is easy to deploy because it uses only the orthographicfeatures in the input texts but not any advanced word-clusters and API gazetteer features And the self-trainingprocess is not used in these two baselines since we havesufficient training data In contrast the full CRF baselineuses advanced features so that it is not as easy-to-deploy asthe basic CRF But the full CRF has much better performancethan the basic CRF as reported in [2] Comparing our modelwith these two baselines we can understand whether ourmodel can achieve a good tradeoff between easy-to-deployand high performance Note that as Ye et al [2] alreadydemonstrates the superior performance of machine-learningbased API extraction methods over rule-based methods wedo not consider rule-based baselines for API extraction inthis studyResults From Table 4 we can see that

bull Although linear CRF with full features has close performance toour model linear CRF with only orthographic features performspoorly in distinguishing API tokens from non-API tokens Thebest F1-score of the basic CRF is only 066 for PandasThe F1-score of the basic CRF for Matplotlib Numpy andOpenGL is 052-059 For JDBC and React the F1-score ofthe basic CRF is below 03 Our results are consistent withthe results in [2] when advanced word-clusters and API-gazetteer features are ablated Compared with the basicCRF the full CRF performs much better For Pandas JDBCand React the full CRF has very close performance to ourmodel But for Matplotlib Numpy and OpenGL our modelstill outperforms the full CRF by at least four points inF1-score

bull Multi-level feature embedding by our neural architecture iseffective in distinguishing API tokens from non-API tokensin the resulting embedding space All evaluation metrics ofour neural architecture are significantly higher than thoseof the basic CRF Although our neural architecture per-forms relatively worse on NumPy and JDBC its F1-scoreis still much higher than the F1-score of the basic CRFon NumPy and JDBC We examine false positives (non-API token recognized as API) and false negatives (APItoken recognized as non-API) by our neural architectureon NumPy and JDBC We find that many false positivesand false negatives involve API mentions composed ofcomplex strings for example array expressions for Numpyor SQL statements for JDBC It seems that neural networkslearn some features from such complex strings that mayconfuse the model and cause the failure to tell apart APItokens from non-API ones Furthermore by analysing theclassification results for different API mentions we findthat our model has good generalizability on different APIfunctions

With linear CRF for API extraction we cannot achieve agood tradeoff between easy-to-deploy and high performance Incontrast our neural architecture can extract effective character-word- and sentence-level features from the input texts and theextracted features alone can support high-quality API extractionwithout using any hand-crafted advanced features such as wordclusters API gazetteers

63 Impact of Feature Extractors (RQ2)Motivation Although our neural architecture achieves verygood performance as a whole we would like to furtherinvestigate how much different feature extractors (character-level CNN word embeddings and sentence-context Bi-LSTM) contribute to this good performance and how dif-ferent features affect precision and recall of API extraction

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 10

TABLE 5 The Results of Feature Ablation Experiments

Ablating char CNN Ablating Word Embeddings Ablating Bi-LSTM All featuresPrec Recall F1-score Prec Recall F1-score Prec Recall F1-score Prec Recall F1-score

Matplotlib 7570 7232 7398 8175 6399 7179 8259 6934 7544 8150 8392 8270Numpy 8100 6040 6921 7931 5847 6731 8062 5667 6644 7824 6291 6974Pandas 8050 8328 8182 8177 6801 7425 8322 7522 7965 8273 8530 8280Opengl 7758 8537 8129 8307 6803 7504 9852 7208 8305 8583 8374 8477JDBC 6865 6125 6447 6422 6562 6491 9928 4312 6613 8469 5531 6692React 6990 8571 7701 8437 7500 7941 9879 6508 7847 8795 7817 8277

Approach We ablate one kind of feature at a time fromour neural architecture That is we obtain a model withoutcharacter-level CNN one without word embeddings andone without sentence-context Bi-LSTM We compare theperformance of these three models with that of the modelwith all three feature extractors

Results In Table 5 we highlight in bold the largest drop ineach metric for a library when ablating a feature comparedwith that metric with all features We underline the increasein each metric for a library when ablating a feature com-pared with that metric with all features We can see that

bull The performance of our neural architecture is contributed bythe combined action of all its features Ablating any of thefeatures the F1-score degrades Ablating word embed-dings or Bi-LSTM causes the largest drop in F1-scorefor four libraries while ablating char-CNN causes thelargest drop in F1-score for two libraries Feature ablationhas higher impact on some libraries than others Forexample ablating char-CNN word embeddings and Bi-LSTM all cause significant drop in F1-score for Matplotliband ablating word embeddings causes significant dropin F1-score for Pandas and React In contrast ablating afeature causes relative minor drop in F1-score for NumpyJDBC and React This indicates different levels of featureslearned by our neural architecture can be more distinct forsome libraries but more complementary for others Whendifferent features are more distinct from one anotherachieving good API extraction performance relies moreon all the features

bull Different features play different roles in distinguishing APItokens from non-API tokens Ablating char-CNN causesthe drop in precision for five libraries except NumpyThe precision degrades rather significantly for Matplotlib(71) OpenGL (96) JDBC (189) and React (205)In contrast ablating char-CNN causes significant drop inrecall only for Matplotlib (138) For JDBC and React ab-lating char-CNN even causes significant increase in recall(107 and 96 respectively) Different from the impactof ablating char-CNN ablating word embeddings and Bi-LSTM usually causes the largest drop in recall rangingfrom 99 drop for Numpy to 237 drop for MatplotlibAblating word embeddings causes significant drop inprecision only for JDBC and it causes small changes inprecision for the other five libraries (three with small dropand two with small increase) Ablating Bi-LSTM causesthe increase in precision for all six libraries Especiallyfor OpenGL JDBC and React when ablating Bi-LSTM themodel has almost perfect precision (around 99) but thiscomes with a significant drop in recall (139 for OpenGL193 for JDBC and 167 React)

All feature extractors are important for high-quality API ex-traction Char-CNN is especially useful for filtering out non-API tokens while word embeddings and Bi-LSTM are especiallyuseful for recalling API tokens

64 Effectiveness of Within-Language Transfer Learn-ing (RQ3)Motivation As described in Section 611 we intentionallychoose three different-functionality Python libraries PandasMatplotlib NumPy Pandas and NumPy are functionally sim-ilar while the other two pairs are more distant The threelibraries also have different API-naming and API-mentioncharacteristics (see Section 2) We want to investigate theeffectiveness of transfer learning for API extraction acrossdifferent pairs of libraries and with different amount oftarget-library training dataApproach We use one library as source library and one ofthe other two libraries as target library We have six transfer-learning pairs for the three libraries We denote them assource-library-name rarr target-library-name such as Pandas rarrNumPy which represents transferring Pandas-trained modelto NumPy text Two of these six pairs (Pandas rarr NumPy andNumPy rarr Pandas) are similar-libraries transfer (in terms oflibrary functionalities and API-namingmention character-istics) while the other four pairs involving Matplotlib arerelatively more-distant-libraries transfer

TABLE 6 NumPy (NP) or Pandas (PD) rarr Matplotlib (MPL)NPrarrMPL PDrarrMPL MPL

Prec Recall F1 Prec Recall F1 Prec Recall F111 8264 8929 8584 8102 8006 8058 8767 7827 827012 8184 8452 8316 7161 8333 7696 8138 7024 754014 7183 7589 7381 6788 7798 7265 8122 5536 658418 7056 7560 7298 6966 7381 7171 7500 5357 6250116 7356 7202 7278 6648 7202 6916 8070 2738 4089132 7256 7883 7369 7147 6935 7047 9750 1160 2074DU 7254 6607 6916 7699 5476 6400

TABLE 7 Matplotlib (MPL) or Pandas (PD) rarr NumPy (NP)MPLrarrNP PDrarrNP NP

Prec Recall F1 Prec Recall F1 Prec Recall F111 8685 7708 8168 7751 6750 7216 7824 6292 697412 788 8208 8041 7013 6751 6879 7513 6042 669714 9364 7667 8000 6588 7000 6781 7714 4500 568418 7373 7833 7596 6584 6667 6625 7103 4292 5351116 7619 6667 7111 5833 6417 6114 5707 4875 5258132 7554 5792 6556 6027 5625 5823 7272 2333 3533DU 6416 6525 6471 6296 5932 6108

We use gradually-reduced target-library training data(1 for all data 12 14 132) to fine-tune the source-library-trained model We also train a target-library modelfrom scratch (ie with randomly initialized model parame-ters) using the same proportion of the target-library trainingdata for comparison We also use the source-library-trained

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 11

TABLE 8 Matplotlib (MPL) or NumPy (NP) rarr Pandas (PD)MPLrarrPD NPrarrPD PD

Prec Recall F1 Prec Recall F1 Prec Recall F111 8497 8962 8723 8686 8761 8723 8043 8530 828012 8618 8443 8530 8397 8300 8348 8801 7406 804314 8107 8761 8421 8107 8270 8188 7650 7695 767218 8754 8098 8413 8576 7637 8079 6930 8329 7565

116 8204 7896 8047 8245 7579 7898 8421 5072 5331132 8131 7522 7814 8143 7205 7645 8425 3084 4514DU 7165 4006 5139 7500 3545 4814

TABLE 9 Averaged Matplotlib (MPL) or NumPy(NP)rarrPandas (PD)

MPLrarrPD NPrarrPDPrec Recall F1 Prec Recall F1

116 8333 7781 8043 8339 7522 7909132 7950 7378 7653 8615 7115 7830

model directly without any fine-tuning (ie 01 target-library data) as a baseline We also randomly select 116 and132 training data for NumPy rarr Pandas and Matplotlib rarrPandas for 10 times and train the model for each time Thenwe calculate the averaged precision and recall values Basedon the averaged precision and recall values we calculate theaveraged F1-scoreResults Table 6 Table 7 and Table 8 show the experimentresults of the six pairs of transfer learning Table 9 showsthe averaged experiment results of NumPy rarr Pandas andMatplotlib rarr Pandas with 116 and 132 training dataAcronyms stand for MPL (Matplotlib) NP (NumPy) PD(Pandas) DU (Direct Use) The last column is the results oftraining the target-library model from scratch We can seethatbull Transfer learning can produce a better-performance target-

library model than training the target-library model fromscratch with the same proportion of training data This is ev-ident as the F1-score of the target-library model obtainedby fine-tuning the source-library-trained model is higherthan the F1-score of the target-library model trained fromscratch in all the transfer settings but PD rarr MPL at11 Especially for NumPy the NumPy model trainedfrom scratch with all Numpy training data has F1-score6974 However the NumPy model transferred from theMatplotlib model has F1-score 8168 (171 improvement)For the Matplotlib rarr Nump transfer even with only 116Numpy training data for fine-tuning the F1-score (7111) ofthe transferred NumPy model is still higher than the F1-score of the NumPy model trained from scratch with allNumpy training data This suggests that transfer learningmay boost the model performance even for the difficultdataset Such performance boost by transfer learning hasalso been observed in many studies [14] [44] [45] Thereason is that a target-library model can ldquoreuserdquo muchknowledge in the source-library model rather than hav-ing to learning completely new knowledge from scratch

bull Transfer learning can reduce the amount of target-library train-ing data required to obtain a high-quality model For four outof six transfer-learning pairs (NP rarr MPL MPL rarr NP NPrarr PD MPL rarr PD) the reduction ranges from 50 (12)to 875 (78) while the resulting target-library model stillhas better F1-score than the target-library model trainedfrom scratch using all target-library training data If we al-low a small degradation of F1-score (eg 3 of the F1-sore

for the target-library model trained from scratch usingall target-library training data) the reduction can go upto 938 (1516) For example for Matplotlib rarr Pandasusing 116 Pandas training data (ie only about 20 posts)for fine-tuning the obtained Pandas model still has F1-score 8047

bull Transfer learning is very effective in few-shot training settingsFew-shot training refers to training a model with onlya very small amount of data [46] In our experimentsusing 132 training data ie about 10 posts for transferlearning the F1-score of the obtained target-library modelis still improved a few points for NP rarr MPL PD rarr MPLand MPL rarr NP and is significantly improved for MPL rarrPD (529) and NP rarr PD (583) compared with directlyreusing source-library-trained model without fine-tuning(DU row) Furthermore the averaged results of NP rarrPD and MPL rarr PD for 10 randomly selected 116 and132 training data are similar to the results in Table 8and the variances are all smaller than 03 which showsour training data is unbiased Although the target-librarymodel trained from scratch still has reasonable F1-scorewith 12 to 18 training data they become completelyuseless with 116 or 132 training data In contrast thetarget-library models obtained through transfer learninghave significant better F1-score in such few shot settingsFurthermore training a model from scratch using few-shot data may result in abnormal increase in precision(eg MPL at 116 and 132 NP at 132 PD at 132) or inrecall (eg NP at 116) compared with training the modelfrom scratch using more data This abnormal increase inprecision (or recall) comes with a sharp decrease in recall(or precision) Our analysis shows that this phenomenonis caused by the biased training data in few-shot settingsIn contrast the target-library models obtained throughtransfer learning can reuse knowledge in the source modeland produce a much more balanced precision and recall(thus much better F1-score) even in the face of biased few-shot training data

bull The effectiveness of transfer learning does not correlate withthe quality of source-library-trained model Based on a not-so-high-quality source model (eg NumPy) we can still ob-tain a high-quality target model through transfer learningFor example NP rarr PD results in a Pandas model with F1-score gt 80 using only 18 of Pandas training data On theother hand a high-quality source model (eg Pandas) maynot boost the performance of the target model Amongall 36 transfer settings we have one such case ie PDrarr MPL at 11 The Matplotlib model transferred fromthe Pandas model using all Matplotlib training data isslightly worse than the Matplotlib model trained fromscratch using all training data This can be attributed tothe differences of API naming characteristics between thetwo libraries (see the last bullet)

bull The more similar the functionalities and the API-namingmention characteristics between the two libraries arethe more effectiveness the transfer learning can be Pan-das and NumPy have similar functionalities and API-namingmention characteristics For PD rarr NP and NPrarr PD the target-library model is less than 5 worse inF1-score with 18 target-library training data comparedwith the target-library model trained from scratch using

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 12

all training data In contrast PD rarr MPL has 69 dropin F1-score with 12 training data and NP rarr MPL has107 drop in F1-score with 14 training data comparedwith the Matplotlib model trained from scratch using alltraining data

bull Knowledge about non-polysemous APIs seems easy to expandwhile knowledge about polysemous APIs seems difficult toadapt Matplotlib has the least polysemous APIs (16)and Pandas and NumPy has much more polysemous APIsBoth MPL rarr PD and MPL rarr NP have high-qualitytarget models even with 18 target-library training dataThe transfer learning seems to be able to add the newknowledge ldquocommon words can be APIsrdquo to the target-library model In contrast NP rarr MPL requires at 12training data to have a high-quality target model (F1-score8316) and for PD rarr MPL the Matplotlib model (F1-score8058) by transfer learning with all training data is evenslightly worse than the model trained from scratch (F1-score 827) These results suggest that it can be difficultto adapt the knowledge ldquocommon words can be APIsrdquolearned from NumPy and Pandas to Matplotlib which doesnot have many common-word APIs

Transfer learning can effectively boost the performance of target-library model with less demand on target-library training dataIts effectiveness correlates more with the similarity of libraryfunctionalities and and API-namingmention characteristicsthan the initial quality of source-library model

65 Effectiveness of Across-Language Transfer Learn-ing (RQ4)Motivation In the RQ3 we investigate the transferabilityof API extraction model across three Python libraries Inthe RQ4 we want to further investigate the transferabilityof API extraction model in a more challenging setting ieacross different programming languages and across librarieswith very different functionalitiesApproach For the source language we choose Python oneof the most popular programming languages We randomlyselect 200 posts in each dataset of the three Python librariesand combine these 600 posts as the training data for PythonAPI extraction model For the target languages we choosethree other popular programming languages Java JavaScriptand C That is we have three transfer-learning pairs in thisRQ Python rarr Java Python rarr JavaScript and Python rarrC For the three target languages we intentionally choosethe libraries that support very different functionalities fromthe three Python libraries For Java we choose JDBC (anAPI for accessing relational database) For JavaScript wechoose React (a library for web graphical user interface)For C we choose OpenGL (a library for 3D graphics) Asdescribed in Section 612 we label 600 posts for each target-language library for the experiments As in the RQ3 weuse gradually-reduced target-language training data to fine-tune the source-language-trained modelResults Table 10 Table 11 and Table 12 show the experimentresults for the three across-language transfer-learning pairsThese three tables use the same notation as Table 6 Table 7and Table 8 We can see thatbull Directly reusing the source-language-trained model on the

target-language text produces unacceptable performance For

TABLE 10 Python rarr Java

PythonrarrJava JavaPrec Recall F1 Prec Recall F1

11 7745 6656 7160 8469 5531 669212 7220 6250 6706 7738 5344 632214 6926 5000 5808 7122 4719 567718 5000 6406 5616 7515 3969 5194

116 5571 4825 5200 7569 3406 4698132 5699 4000 4483 7789 2312 3566DU 4444 2875 3491

TABLE 11 Python rarr JavaScript

PythonrarrJavaScript JavaScriptPrec Recall F1 Prec Recall F1

11 7745 8175 8619 8795 7817 827712 8693 7659 8143 8756 5984 777014 8684 6865 7424 8308 6626 737218 8148 6111 6984 8598 5595 6788

116 7111 6349 6808 8738 3848 5344132 6667 5238 5867 6521 3571 4615DU 5163 2500 3369

the three Python libraries directly reusing a source-library-trained model on the text of a target-library maystill have reasonable performance for example NP rarrMPL (F1-score 6916) PD rarr MPL (640) MPL rarr NP(6471) However for the three across-language transfer-learning pairs the F1-score of direct model reuse is verylow Python rarr Java (3491) Python rarr JavaScript (3369)Python rarr C (3595) These results suggest that there arestill certain level of commonalities between the librarieswith similar functionalities and of the same programminglanguage so that the knowledge learned in the source-library model may be largely reused in the target-librarymodel In contrast the libraries of different functionali-ties and programming languages have much fewer com-monalities which makes it infeasible to directly deploya source-language-trained model to the text of anotherlanguage In such cases we have to either train the modelfor each language or library from scratch or we may ex-ploit transfer learning to adapt a source-language-trainedmodel to the target-language text

bull Across-language transfer learning holds the same performancecharacteristics as within-language transfer learning but itdemands more target-language training data to obtain a high-quality model than within-language transfer learning ForPython rarr JavaScript and Python rarr C with as minimumas 116 target-language training data we can obtaina fine-tuned target-language model whose F1-score candouble that of directly reusing the Python-trained modelto JavaScript or C text For Python rarr Java fine-tuning thePython-trained model with 116 Java training data booststhe F1-score by 50 compared with directly applying thePython-trained model to Java text For all the transfer-learning settings in Table 10 Table 11 and Table 12the fine-tuned target-language model always outperformsthe corresponding target-language model trained fromscratch with the same amount of target-language trainingdata However it requires at least 12 target-languagetraining data to fine-tune a target-language model toachieve the same level of performance as the target-language model trained from scratch with all target-

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 13

TABLE 12 Python rarr C

PythonrarrC CPrec Recall F1 Prec Recall F1

11 8987 8509 8747 8583 8374 847712 8535 8373 8454 8628 7669 812114 8383 7588 7965 8219 7127 763418 7432 7371 7401 7868 6802 7297

116 7581 6945 7260 8852 5710 6942132 6962 6585 6779 8789 4525 5975DU 5649 2791 3595

language training data For the within-language transferlearning achieving this result may require as minimum as18 target-library training data

Software text of different programming languages and of li-braries with very different functionalities increases the difficultyof transfer learning but transfer learning can still effectivelyboost the performance of the target-language model with 12less target-language training data compared with training thetarget-language model from scratch with all training data

66 Threats to Validity

The major threat to internal validity is the API labelingerrors in the dataset In order to decrease errors the first twoauthors first independently label the same data and thenresolve the disagreements by discussion Furthermore inour experiments we have to examine many API extractionresults We only spot a few instances of overlooked orerroneously-labeled API mentions Therefore the quality ofour dataset is trustworthy

The major threat to external validity is the generalizationof our results and findings Although we invest significanttime and effort to prepare datasets conduct experimentsand analyze results our experiments involve only six li-braries of four programming languages The performance ofour neural architecture and especially the findings on trans-fer learning could be different with other programminglanguages and libraries Furthermore the software text inall the experiments comes from a single data source ieStack Overflow Software text from other data sources mayexhibit different API-mention and discussion-context char-acteristics which may affect the model performance In thefuture we will reduce this threat by applying our approachto more languageslibraries and informal software text fromother data sources (eg developer emails)

7 RELATED WORK

APIs are the core resources for software development andAPI related knowledge is widely present in software textssuch as API reference documentation QampA discussionsbug reports Researchers have proposed many API extrac-tion methods to support various software engineering tasksespecially document traceability recovery [42] [47] [48][49] [50] For example RecoDoc [3] extracts Java APIsfrom several resources and then recover traceability acrossdifferent sources Subramanian et al [4] use code contextinformation to filter candidate APIs in a knowledge basefor an API mention in a partial code fragment Christophand Robillard [51] extracts sentences about API usage from

Stack Overflow and use them to augment API referencedocumentation These works focus mainly on the API link-ing task ie linking API mentions in text or code to someAPI entities in a knowledge base In terms of extracting APImentions in software text they rely on rule-based methodsfor example regular expressions of distinct orthographicfeatures of APIs such as camelcase special characters (eg or ()) and API annotations

Several studies such as Bacchelli et al [42] [43] and Yeet al [2] show that regular expressions of distinct ortho-graphic features are not reliable for API extraction tasks ininformal texts such as emails Stack Overflow posts Bothvariations of sentence formats and the wide presence ofmentions of polysemous API simple names pose a greatchallenge for API extraction in informal texts Howeverthis challenge is generally avoided by considering onlyAPI mentions with distinct orthographic features in existingworks [3] [51] [52]

Island parsing provides a more robust solution for ex-tracting API mentions from texts Using an island grammarwe can separate the textual content into constructs of in-terest (island) and the remainder (water) [53] For exampleBacchelli et al [54] uses island parsing to extract code frag-ments from natural language text Rigby and Robillard [52]also use island parser to identify code-like elements that canpotentially be APIs However these island parsers cannoteffectively deal with mentions of API simple names that arenot suffixed by () such as apply and series in Fig 1 whichare very common writing forms of API methods in StackOverflow discussions [2]

Recently Ye et al [10] proposes a CRF based approachfor extracting methods of software entities such as pro-gramming languages libraries computing concepts andAPIs from informal software texts They report that extractAPI mentions is much more challenging than extractingother types of software entities A follow-up work by Yeet al [2] proposes to use a combination of orthographicword-clusters and API gazetteer features to enhance theCRFrsquos performance on API extraction tasks Although theseproposed features are effective they require much manualeffort to develop and tune Furthermore enough trainingdata has to be prepared and manually labeled for applyingtheir approach to the software text of each library to beprocessed These overheads pose a practical limitation todeploying their approach to a larger number of libraries

Our neural architecture is inspired by recent advancesof neural network techniques for NLP tasks For exampleboth RNNs and CNNs have been used to embed character-level features for question answering [55] [56] machinetranslation [57] text classification [58] and part-of-speechtagging [59] Some researchers also use word embeddingsand LSTMs for NER [35] [60] [61] To the best of our knowl-edge our neural architecture is the first machine learningbased API extraction method that combines these proposalsand customize them based on the characteristics of soft-ware texts and API names Furthermore the design of ourneural architecture also takes into account the deploymentoverhead of the API methods for multiple programminglanguages and libraries which has never been explored forAPI extraction tasks

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 14

8 CONCULSION

This paper presents a novel multi-layer neural architecturethat can effectively learn important character- word- andsentence-level features from informal software texts for ex-tracting API mentions in text The learned features has supe-rior performance than human-defined orthographic featuresfor API extraction in informal software texts This makes ourneural architecture easy to deploy because the only input itrequires is the texts to be processed In contrast existingmachine learning based API extraction methods have touse additional hand-crafted features such as word clustersor API gazetteers in order to achieve the performanceclose to that of our neural architecture Furthermore as thefeatures are automatically learned from the input texts ourneural architecture is easy to transfer and fine-tune acrossprogramming languages and libraries We demonstrate itstransferability across three Python libraries and across fourprogramming languages Our neural architecture togetherwith transfer learning makes it easy to train and deploya high-quality API extraction model for multiple program-ming languages and libraries with much less overall effortrequired for preparing training data and effective featuresIn the future we will further investigate the performanceand the transferability of our neural architecture in manyother programming languages and libraries moving to-wards real-world deployment of machine learning basedAPI extraction methods

REFERENCES

[1] C Parnin C Treude L Grammel and M-A Storey ldquoCrowddocumentation Exploring the coverage and the dynamics of apidiscussions on stack overflowrdquo

[2] D Ye Z Xing C Y Foo J Li and N Kapre ldquoLearning to extractapi mentions from informal natural language discussionsrdquo in Soft-ware Maintenance and Evolution (ICSME) 2016 IEEE InternationalConference on IEEE 2016 pp 389ndash399

[3] B Dagenais and M P Robillard ldquoRecovering traceability linksbetween an api and its learning resourcesrdquo in Software Engineering(ICSE) 2012 34th International Conference on IEEE 2012 pp 47ndash57

[4] S Subramanian L Inozemtseva and R Holmes ldquoLive api doc-umentationrdquo in Proceedings of the 36th International Conference onSoftware Engineering ACM 2014 pp 643ndash652

[5] C Chen Z Xing and X Wang ldquoUnsupervised software-specificmorphological forms inference from informal discussionsrdquo in Pro-ceedings of the 39th International Conference on Software EngineeringIEEE Press 2017 pp 450ndash461

[6] X Chen C Chen D Zhang and Z Xing ldquoSethesaurus Wordnetin software engineeringrdquo IEEE Transactions on Software Engineer-ing 2019

[7] H Li S Li J Sun Z Xing X Peng M Liu and X Zhao ldquoIm-proving api caveats accessibility by mining api caveats knowledgegraphrdquo in Software Maintenance and Evolution (ICSME) 2018 IEEEInternational Conference on IEEE 2018

[8] Z X D L X W Qiao Huang Xin Xia ldquoApi method recom-mendation without worrying about the task-api knowledge gaprdquoin Automated Software Engineering (ASE) 2018 33th IEEEACMInternational Conference on IEEE 2018

[9] B Xu Z Xing X Xia and D Lo ldquoAnswerbot automated gen-eration of answer summary to developersz technical questionsrdquoin Proceedings of the 32nd IEEEACM International Conference onAutomated Software Engineering IEEE Press 2017 pp 706ndash716

[10] D Ye Z Xing C Y Foo Z Q Ang J Li and N Kapre ldquoSoftware-specific named entity recognition in software engineering socialcontentrdquo in 2016 IEEE 23rd International Conference on SoftwareAnalysis Evolution and Reengineering (SANER) vol 1 IEEE 2016pp 90ndash101

[11] E F Tjong Kim Sang and F De Meulder ldquoIntroduction to the conll-2003 shared task Language-independent named entity recogni-tionrdquo in Proceedings of the seventh conference on Natural languagelearning at HLT-NAACL 2003-Volume 4 Association for Compu-tational Linguistics 2003 pp 142ndash147

[12] R Caruana ldquoLearning many related tasks at the same time withbackpropagationrdquo in Advances in neural information processing sys-tems 1995 pp 657ndash664

[13] A Arnold R Nallapati and W W Cohen ldquoExploiting featurehierarchy for transfer learning in named entity recognitionrdquo Pro-ceedings of ACL-08 HLT pp 245ndash253 2008

[14] J Yosinski J Clune Y Bengio and H Lipson ldquoHow transferableare features in deep neural networksrdquo in Advances in neuralinformation processing systems 2014 pp 3320ndash3328

[15] Z X Deheng Ye Lingfeng Bao and S-W Lin ldquoApireal An apirecognition and linking approach for online developer forumsrdquoEmpirical Software Engineering 2018 2018

[16] C dos Santos and M Gatti ldquoDeep convolutional neural networksfor sentiment analysis of short textsrdquo in Proceedings of COLING2014 the 25th International Conference on Computational LinguisticsTechnical Papers 2014 pp 69ndash78

[17] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[18] G Chen C Chen Z Xing and B Xu ldquoLearning a dual-languagevector space for domain-specific cross-lingual question retrievalrdquoin 2016 31st IEEEACM International Conference on Automated Soft-ware Engineering (ASE) IEEE 2016 pp 744ndash755

[19] C Chen X Chen J Sun Z Xing and G Li ldquoData-driven proac-tive policy assurance of post quality in community qampa sitesrdquoProceedings of the ACM on human-computer interaction vol 2 noCSCW p 33 2018

[20] I Bazzi ldquoModelling out-of-vocabulary words for robust speechrecognitionrdquo PhD dissertation Massachusetts Institute of Tech-nology 2002

[21] C Chen S Gao and Z Xing ldquoMining analogical libraries in qampadiscussionsndashincorporating relational and categorical knowledgeinto word embeddingrdquo in 2016 IEEE 23rd international conferenceon software analysis evolution and reengineering (SANER) vol 1IEEE 2016 pp 338ndash348

[22] C Chen and Z Xing ldquoSimilartech automatically recommendanalogical libraries across different programming languagesrdquo in2016 31st IEEEACM International Conference on Automated SoftwareEngineering (ASE) IEEE 2016 pp 834ndash839

[23] C Chen Z Xing and Y Liu ldquoWhats spains paris mining analogi-cal libraries from qampa discussionsrdquo Empirical Software Engineeringvol 24 no 3 pp 1155ndash1194 2019

[24] Y Huang C Chen Z Xing T Lin and Y Liu ldquoTell them apartdistilling technology differences from crowd-scale comparisondiscussionsrdquo in ASE 2018 pp 214ndash224

[25] O Levy and Y Goldberg ldquoNeural word embedding as implicitmatrix factorizationrdquo in Advances in neural information processingsystems 2014 pp 2177ndash2185

[26] D Tang F Wei N Yang M Zhou T Liu and B Qin ldquoLearningsentiment-specific word embedding for twitter sentiment classifi-cationrdquo in Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) vol 1 2014pp 1555ndash1565

[27] O Levy Y Goldberg and I Dagan ldquoImproving distributional sim-ilarity with lessons learned from word embeddingsrdquo Transactionsof the Association for Computational Linguistics vol 3 pp 211ndash2252015

[28] J Pennington R Socher and C Manning ldquoGlove Global vectorsfor word representationrdquo in Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP) 2014 pp1532ndash1543

[29] I Sutskever O Vinyals and Q V Le ldquoSequence to sequencelearning with neural networksrdquo in Advances in neural informationprocessing systems 2014 pp 3104ndash3112

[30] C Chen Z Xing and Y Liu ldquoBy the community amp for the com-munity a deep learning approach to assist collaborative editing inqampa sitesrdquo Proceedings of the ACM on Human-Computer Interactionvol 1 no CSCW p 32 2017

[31] S Gao C Chen Z Xing Y Ma W Song and S-W Lin ldquoA neuralmodel for method name generation from functional descriptionrdquo

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 15

in 2019 IEEE 26th International Conference on Software AnalysisEvolution and Reengineering (SANER) IEEE 2019 pp 414ndash421

[32] X Wang C Chen and Z Xing ldquoDomain-specific machine trans-lation with recurrent neural network for software localizationrdquoEmpirical Software Engineering pp 1ndash32 2019

[33] C Chen Z Xing Y Liu and K L X Ong ldquoMining likelyanalogical apis across third-party libraries via large-scale unsu-pervised api semantics embeddingrdquo IEEE Transactions on SoftwareEngineering 2019

[34] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no 11pp 2673ndash2681 1997

[35] Z Huang W Xu and K Yu ldquoBidirectional lstm-crf models forsequence taggingrdquo arXiv preprint arXiv150801991 2015

[36] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural computation vol 9 no 8 pp 1735ndash1780 1997

[37] H Sak A Senior and F Beaufays ldquoLong short-term memoryrecurrent neural network architectures for large scale acousticmodelingrdquo in Fifteenth annual conference of the international speechcommunication association 2014

[38] Y Bengio ldquoDeep learning of representations for unsupervised andtransfer learningrdquo in Proceedings of ICML Workshop on Unsupervisedand Transfer Learning 2012 pp 17ndash36

[39] S J Pan Q Yang et al ldquoA survey on transfer learningrdquo[40] N Srivastava G Hinton A Krizhevsky I Sutskever and

R Salakhutdinov ldquoDropout A simple way to prevent neural net-works from overfittingrdquo The Journal of Machine Learning Researchvol 15 no 1 pp 1929ndash1958 2014

[41] D P Kingma and J Ba ldquoAdam A method for stochastic optimiza-tionrdquo arXiv preprint arXiv14126980 2014

[42] A Bacchelli M Lanza and R Robbes ldquoLinking e-mails andsource code artifactsrdquo in Proceedings of the 32nd ACMIEEE Inter-national Conference on Software Engineering-Volume 1 ACM 2010pp 375ndash384

[43] A Bacchelli M DrsquoAmbros M Lanza and R Robbes ldquoBench-marking lightweight techniques to link e-mails and source coderdquoin Reverse Engineering 2009 WCRErsquo09 16th Working Conference onIEEE 2009 pp 205ndash214

[44] S J Pan and Q Yang ldquoA survey on transfer learningrdquo IEEETransactions on knowledge and data engineering vol 22 no 10 pp1345ndash1359 2010

[45] M Oquab L Bottou I Laptev and J Sivic ldquoLearning and transfer-ring mid-level image representations using convolutional neuralnetworksrdquo in Computer Vision and Pattern Recognition (CVPR) 2014IEEE Conference on IEEE 2014 pp 1717ndash1724

[46] S Ravi and H Larochelle ldquoOptimization as a model for few-shotlearningrdquo 2016

[47] A Marcus J Maletic et al ldquoRecovering documentation-to-source-code traceability links using latent semantic indexingrdquo in Soft-ware Engineering 2003 Proceedings 25th International Conference onIEEE 2003 pp 125ndash135

[48] H-Y Jiang T N Nguyen X Chen H Jaygarl and C K ChangldquoIncremental latent semantic indexing for automatic traceabil-ity link evolution managementrdquo in Proceedings of the 2008 23rdIEEEACM International Conference on Automated Software Engineer-ing IEEE Computer Society 2008 pp 59ndash68

[49] W Zheng Q Zhang and M Lyu ldquoCross-library api recommen-dation using web search enginesrdquo in Proceedings of the 19th ACMSIGSOFT symposium and the 13th European conference on Foundationsof software engineering ACM 2011 pp 480ndash483

[50] Q Gao H Zhang J Wang Y Xiong L Zhang and H Mei ldquoFixingrecurring crash bugs via analyzing qampa sites (t)rdquo in AutomatedSoftware Engineering (ASE) 2015 30th IEEEACM International Con-ference on IEEE 2015 pp 307ndash318

[51] C Treude and M P Robillard ldquoAugmenting api documentationwith insights from stack overflowrdquo in Software Engineering (ICSE)2016 IEEEACM 38th International Conference on IEEE 2016 pp392ndash403

[52] P C Rigby and M P Robillard ldquoDiscovering essential codeelements in informal documentationrdquo in Proceedings of the 2013International Conference on Software Engineering IEEE Press 2013pp 832ndash841

[53] L Moonen ldquoGenerating robust parsers using island grammarsrdquoin Reverse Engineering 2001 Proceedings Eighth Working Conferenceon IEEE 2001 pp 13ndash22

[54] A Bacchelli A Cleve M Lanza and A Mocci ldquoExtracting struc-tured data from natural language documents with island parsingrdquo

in Automated Software Engineering (ASE) 2011 26th IEEEACMInternational Conference on IEEE 2011 pp 476ndash479

[55] Y Kim Y Jernite D Sontag and A M Rush ldquoCharacter-awareneural language modelsrdquo in AAAI 2016 pp 2741ndash2749

[56] D Lukovnikov A Fischer J Lehmann and S Auer ldquoNeuralnetwork-based question answering over knowledge graphs onword and character levelrdquo in Proceedings of the 26th internationalconference on World Wide Web International World Wide WebConferences Steering Committee 2017 pp 1211ndash1220

[57] W Ling I Trancoso C Dyer and A W Black ldquoCharacter-basedneural machine translationrdquo arXiv preprint arXiv151104586 2015

[58] X Zhang J Zhao and Y LeCun ldquoCharacter-level convolutionalnetworks for text classificationrdquo in Advances in neural informationprocessing systems 2015 pp 649ndash657

[59] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[60] J P Chiu and E Nichols ldquoNamed entity recognition with bidirec-tional lstm-cnnsrdquo arXiv preprint arXiv151108308 2015

[61] G Lample M Ballesteros S Subramanian K Kawakami andC Dyer ldquoNeural architectures for named entity recognitionrdquoarXiv preprint arXiv160301360 2016

Suyu Ma is a research assistant in the Facultyof Information Technology at Monash UniversityHe has research interest in the areas of soft-ware engineering Deep learning and Human-computer Interaction He is currently focusing onimproving the usability and accessibility of mo-bile application He received the BS and MSdegrees from Beijing Technology and BusinessUniversity and the Australian National Universityin 2016 and 2018 respectively And he will bea PhD student at Monash University in 2020

under the supervision of Chunyang Chen

Zhenchang Xing is a Senior Lecturer in the Re-search School of Computer Science AustralianNational University Previously he was an Assis-tant Professor in the School of Computer Sci-ence and Engineering Nanyang TechnologicalUniversity Singapore from 2012-2016 Beforejoining NTU Dr Xing was a Lee Kuan Yew Re-search Fellow in the School of Computing Na-tional University of Singapore from 2009-2012Dr Xings current research area is in the interdis-plinary areas of software engineering human-

computer interaction and applied data analytics Dr Xing has over 100publications in peer-refereed journals and conference proceedings andhave received several distinguished paper awards from top softwareengineering conferences Dr Xing regularly serves on the organizationand program committees of the top software engineering conferencesand he will be the program committee co-chair for ICSME2020

Chunyang Chen is a lecturer (Assistant Pro-fessor) in Faculty of Information TechnologyMonash University Australia His research fo-cuses on software engineering deep learningand human-computer interaction He has pub-lished over 16 papers in referred journals orconferences including Empirical Software Engi-neering ICSE ASE CSCW ICSME SANERHe is a member of IEEE and ACM He receivedACM SIGSOFT distinguished paper award inASE 2018 best paper award in SANER 2016

and best tool demo in ASE 2016 httpschunyang-chengithubio

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 16

Cheng Chen received his Bachelor degree inSoftware Engineering from Northwest University(China) and received Master degree of Com-puting (major in Artificial Intelligence) from theAustralian National University He did some in-ternship projects about Natural Language Pro-cessing at CSIRO from 2016 to 2017 He workedas a research assistant supervised by Dr Zhen-chang Xing at ANU in 2018 Cheng currentlyworks in PricewaterhouseCoopers (PwC) Firmas a senior algorithm engineer of Natural Lan-

guage Processing and Data Analyzing Cheng is interested in NamedEnti ty Extraction Relation Extraction Text summarization and parallelcomputing He is working on knowledge engineering and transaction forNLP tasks

Lizhen Qu is a research fellow in the DialogueResearch lab in the Faculty of Information Tech-nology at Monash University He has exten-sive research experience in the areas of naturallanguage processing multimodal learning deeplearning and Cybersecurity He is currently fo-cusing on information extraction semantic pars-ing and multimodal dialogue systems Prior tojoining Monash University he worked as a re-search scientist at Data61CSIRO where he ledand participated in several research and indus-

trial projects including Deep Learning for Text and Deep Learning forCyber

Guoqiang Li is now an associate professor inschool of software Shanghai Jiao Tong Univer-sity and a guest associate professor in KyushuUniversity He received the BS MS and PhDdegrees from Taiyuan University of TechnologyShanghai Jiao Tong University and Japan Ad-vanced Institute of Science and Technology in2001 2005 and 2008 respectively He workedas a postdoctoral research fellow in the graduateschool of information science Nagoya Univer-sity Japan during 2008-2009 as an assistant

professor in the school of software Shanghai Jiao Tong Universityduring 2009-2013 and as an academic visitor in the department ofcomputer science University of Oxford during 2015-2016 His researchinterests include formal verification programming language theory anddata analytics and intelligence He published more than 40 researchespapers in the international journals and mainstream conferences includ-ing TDSC SPE TECS IJFCS SCIS CSCW ICSE FORMATS ATVAetc

Page 5: JOURNAL OF LA Easy-to-Deploy API Extraction by Multi-Level ... · sufficient high-quality training data for APIs of some less frequently discussed libraries or frameworks. Another

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 5

Fig 4 Our Character Vocabulary

Fig 5 Character-Level CNN

parenthesis and comma in print(a32)) or at the end (egthe right parenthesis in apply()) Morphological features mayalso appear in combination such as camelcase writing likeAbstractAction a pair of parentheses like plot() The longlength of some tokens like createdataset is one importantmorphological feature as well Due to the lack of universalnaming convention across libraries and the wide presenceof informal writing forms informative morphological fea-tures of API mentions often vary from one libraryrsquos text toanother

Robust methods to extract morphological features fromtokens must take into account all characters of the tokenand determine which features are more important for a par-ticular libraryrsquos APIs [16] To that end we use a character-level CNN [17] which extracts local features in N-gramcharacters of the token using a convolution operation andthen combines them using a max-pooling operation to createa fixed-sized character-level embedding of the token [18][19]

Let V char be the vocabulary of characters for the soft-ware texts from which we want to extract API mentionsIn this work V char for all of our models consists of 92characters including 26 English letters (both upper andlower case) 10 digits 30 other characters (eg rsquorsquorsquo[rsquorsquorsquo)as listed in Fig 4 Note that V char can be easily extendedfor different datasets Let Echar isin Rdchartimes|V char| be thecharacter embedding matrix where dchar is the dimension ofcharacter embeddings and |V char| is the vocabulary size (92in this work) As illustrated in Fig 3 Echar can be regardedas a dictionary of character embeddings in which a columndchar-dimensional vector corresponds to a particular char-acter The character embeddings are initialized as one-hotvectors and then learned during the training of character-level CNN Given a character c isin V char its embedding ec

can be retrieved by the matrix-vector product ec = Echarvc

where vc is a one-hot vector of size |V char| which has value1 at index c and zero in all other positions

Fig 5 presents the architecture of our character-levelCNN Given a token w in the input text letrsquos assume wis composed of M characters c1 c2 cM We first obtaina sequence of character embeddings ec1 ec2 ecM by

looking up the character embeddings matrix Echar Thissequence of character embeddings (zero-padding at thebeginning and the end of the sequence) is the input to ourcharacter-level CNN In our application of CNN becauseeach character is represented as a dchar-dimensional vectorwe use convolution filters with widths equal to the dimen-sionality of the character embeddings (ie dchar) Then wecan vary the height h (or window size) of the filter iethe number of adjacent characters considered jointly in theconvolution operation

Let zm be the concatenation of the character embeddingsof cm (1 le m le M) the (h minus 1)2 left neighbors of cmand the (h minus 1)2 right neighbors of cm A convolutionoperation involves a filter W isin Rhdchar

(a matrix of htimesdchar

dimensions) and a bias term b isin Rh which is appliedrepeatedly to each character window of size h in the inputsequence c1 c2 cM

om = ReLU(WT middot zm + b)

where ReLU(x) = max(0 x) is the non-linear activa-tion function The convolution operations produce a M -dimensional feature map for a particular filter A 1D-maxpooling operation is then applied to the feature map toextract a scalar (ie a feature vector of length 1) with themaximum value in the M dimensions of the feature map

The convolution operation extracts local features withineach character window of the given token and using themax over all character windows of the token we extracta global feature for the token We can apply N filters toextract different features from the same character windowThe output of the character-level CNN is an N -dimensionalfeature vector representing the character-level embedding ofthe given token We denote this embedding as echarw for thetoken w

In our character-level CNN the matrices Echar and W and the vector b are parameters to be learned The dimen-sionality of the character embedding dchar the number offilters N and the window size of the filters h are hyper-parameters to be chosen by the user (see Section 52 formodel configuration)

42 Learning Word Semantics by Word Embedding

In informal software texts the same API is often mentionedin many non-standard abbreviations and synonyms [5] Forexample the Pandas library is often written as pd and itsmodule DataFrame is often written as df Furthermore thereis lack of consistent use of verb noun and preposition inthe discussions [2] For example in the sentences ldquoI havedecided to use apply rdquo ldquoif you run apply on a series rdquoand ldquoI tested with apply rdquo users refer to a Pandasrsquos methodapply() but their descriptions vary greatly

Such variations result in out-of-vocabulary (OOV) issuefor a machine learning model [20] ie variations that havenot been seen in the training data For the easy deploymentof a machine learning model for API extraction it is imprac-tical to address the OOV issue by manually labeling a hugeamount of data and developing a comprehensive gazetteerof API names and their common variations [2] Howeverwithout the knowledge about variations of semantically

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 6

similar words the trained model will be restricted to theexamples that it sees in the training data

To address this dilemma we propose to exploit unsu-pervised word-embedding models [21] [22] [23] [24] tolearn distributed word representations from a large amountof unlabeled text discussing a particular library Manystudies [25] [26] [27] have shown that distributed wordrepresentations can capture rich semantics of words suchthat semantically similar words will have similar wordembeddings

Let V word be the vocabulary of words for the corpusof software texts to be processed As illustrated in Fig 3word-level embeddings are encoded by column vectors ina word embedding matrix Eword isin Rdwordtimes|V word| wheredword is the dimension of word embeddings and |V word|is the vocabulary size Each column Eword

i isin Rdword

corre-sponds to the word-level embedding of the i-th word inthe vocabulary V word We can obtain a token wrsquos word-level embedding eword

w by looking up Eword with the wordw ie eword

w = Ewordvw where vw is a one-hot vector ofsize |V word| which has value 1 at index w and zero in allother positions The matrix Eword is to be learned usingunsupervised word embedding models (eg GloVe [28])and dword is a hyper-parameter to be chosen by the user(see Section 52 for model configuration)

In this work we adopt the Global Vectors for WordRepresentation (GloVe) method [28] to learn the matrixEword GloVe is an unsupervised algorithm for learningword representations based on the statistics of word co-occurrences in an unlabeled text corpus It calculates theword embeddings based on a word co-occurrence matrixX Each row in the word co-occurrence matrix correspondsto a word and each column corresponds to a context Xij

is the frequency of word i co-occurring with word j andXi =

983123Xik (1 le k le |V word|) is the total number of occur-

rences of word i in the corpus The probability of word j thatoccurs in the context of word i is Pij = P (j|i) = XijXiWe have log(Pij) = log(Xij)minus log(Xi)

GloVe defines log(Pij) = eTwiewj where ewi and ewj

are the word embeddings to be learned for the word wi

and wj This gives the constraint for each word pair aslog(Xij) = eTwi

ewj + bi + bj where b is the bias termfor ew The cost function for minimizing the loss of wordembeddings is defined as

V word983131

ij=1

f(Xij)(eTwiewj

+ bi + bj minus log(Xij))

where f(Xij) is a weighting function That is GloVe learnsword embeddings by a weighted least square regressionmodel

43 Extracting Sentence-Context Features by Bi-LSTM

In informal software texts many API mentions cannot be re-liably recognized by simply examining a tokenrsquos character-level features and word semantics This is because manyAPIs are named using common English words (eg seriesapply plot) or common computing terms (eg dataframesigmoid histgram argmax zip list) When such APIs are men-tioned in their simple name this results in a common-word

polesemy issue for API extraction [2] In such situations wehave to disambiguate the API sense of a common word fromthe normal sense of the word

To that end we have to look into the sentence context inwhich an API is mentioned For example by looking intothe sentence context of the two sentences in Fig 1 we candetermine that the ldquoapplyrdquo in the first sentence is an APImention while the ldquoapplyrdquo in the second sentence is not anAPI mention Note that both the preceding and succeedingcontext of the token ldquoapplyrdquo are useful for disambiguatingthe API or the normal sense of the token

We use Recurrent Neural Network (RNN) to extractsentence context features for disambiguating the API or thenormal sense of the word [29] RNN is a class of neuralnetworks where connections between units form directedcycles and it is widely used in software engineering do-main [30] [31] [32] [33] Due to this nature it is especiallyuseful for tasks involving sequential inputs [29] like sen-tences In our task we adopt a Bidirectional RNN (Bi-RNN)architecture [34] [35] which is composed of two LSTMsone takes input from the beginning of the text forward tilla particular token while the other takes input from the endof the text backward till that token

The input to an RNN is a sequence of vectors In ourtask we obtain the input vector of a token in the inputtext by concatenating the character-level embedding of thetoken w and the word-level embedding of the token wie echarw oplus eword

w An RNN recursively maps an inputvector xt and a hidden state htminus1 to a new hidden state htht = f(htminus1 xt) where f is a non-linear activation function(eg an LSTM unit used in this work) A hidden stateis a vector esent isin Rdsent

summarizing the sentence-levelfeatures till the input xt where dsent is the dimension of thehidden state vector to be chosen by the user We denote efsentand ebsent as the hidden states computed by the forwardLSTM and the backward LSTM after reading the end of thepreceding and the succeeding sentence context of a tokenrespectively efsent and ebsent are concatenated into one vectoresentw as the Bi-LSTM output for the corresponding token win the input text

As an input text (eg a Stack Overflow post) can be along text modeling long-range dependencies in the text iscrucial for our task For example a mention of a libraryname at the beginning of a post could be important fordetecting a mention of a method of this library later inthe post Therefore we adopt the LSTM unit [36] [37] inour RNN The LSTM is designed to cope with the gradientvanishing problem in RNN An LSTM unit consists of amemory cell and three gates namely the input outputand forget gates Conceptually the memory cell stores thepast contexts and the input and output gates allow thecell to store contexts for a long period of time Meanwhilesome contexts can be cleared by the forget gate Memorycell and the three gates have weights and bias terms tobe learned during model training Bi-LSTM can extract thecontext feature For instance in the Pandas library sentencerdquoThis can be accomplished quite simply with the DataFramemethod applyrdquo based on the context information from theword method the Bi-LSTM can help our model classifythe word apply as API mention and the learning can be

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 7

transferred across different languages and libraries withtransfer learning

44 API Labeling by Softmax ClassificationGiven the Bi-LSTMrsquos output vector esentw for a token w inthe input text we train a binary softmax classifier to predictwhether the token is an API mention or a normal Englishword That is we have two classes in this work ie API ornon-API Softmax predicts the probability for the j-th classgiven a tokenrsquos vector esentw by

P (j|esentw ) =exp(esentw WT

j )9831232

k=1 exp(esentw WT

k )

where the vectors Wk (k=1 or 2) are parameters to belearned

45 Library Adaptation by Transfer LearningTransfer learning [12] [38] is a machine learning methodthat stores knowledge gained while solving one problemand applying it to a different but related problem It helpsto reduce the amount of training data required for the targetproblem When transfer learning is used in deep learningit has been shown to be beneficial for narrowing down thescope of possible models on a task by using the weights ofa trained model on a different but related task [14] and theshared parameters transfer learning can help model adaptthe shared knowledge in similar context [39]

API extraction tasks for different libraries can be re-garded as a set of different but related tasks Therefore weuse transfer learning to adapt a neural model trained withone libraryrsquos text (source-library-trained model) for anotherlibraryrsquos text (target library) Our neural architecture is com-posed of four main components character-level CNN wordembeddings sentence Bi-LSTM and softmax classifier Wecan transfer all or some source-library-trained model com-ponents to a target library model Without transfer learn-ing the parameters of a target-library model componentwill be randomly initialized and then learned using thetarget-library training data ie trained from scratch Withtransfer learning we use the parameters of source-library-trained model components to initialize the parameters ofthe corresponding target-library model components Aftertransferring the model parameters we can either freeze thetransferred parameters or fine-tune the transferred parame-ters using the target-library training data

5 SYSTEM IMPLEMENTATION

This section describes the current implementation 2 of ourneural architecture for API extraction

51 Preprocessing and Tokenizing Input TextOur current implementation takes as input the content of aStack Overflow post As we want to recognize API mentionswithin natural language sentences we remove stand-alonecode snippets in ltpregtltcodegt tag We then remove allHTML tags in the post content to obtain the plain text

2 httpsgithubcomJOJO201API Extraction

input (see Section 3 for the justification of taking plain textas input) We develop a sentence parser to split the pre-processed post text into sentences by punctuation Follow-ing [2] we develop a software-specific tokenizer to tokenizethe sentences This tokenizer preserves the integrity of code-like tokens For example it treats matplotlibpyplotimshow()as a single token instead of a sequence of 7 tokens ieldquomatplotlibrdquo ldquordquo ldquopyplotrdquo ldquordquo ldquoimshowrdquo ldquo(rdquo ldquo)rdquo producedby general English tokenizers

52 Model ConfigurationWe now describe the hyper-parameter settings used in ourcurrent implementation These hyper-parameters are tunedusing the validation data during model training (see Sec-tion 612 for the description of our dataset) We find thatthe neural model has very similar performance across sixlibraries dataset with the same hyper-parameters Thereforewe keep the same hyper-parameters for the six librariesdataset which can also avoid the difficulty in scaling dif-ferent hyper-parameters

521 Character-level CNNWe set the filter window size h = 3 That is the convolutionoperation extracts local features from 3 adjacent charactersat a time The size of our current character vocabulary V char

is 92 Thus we set the dimensionality of the character em-bedding dchar at 92 We initialize the character embeddingwith one-hot vector (1 at one character index and zeroin all other dimensions) The character embeddings willbe updated through back propagation during the trainingof character-level CNN We experiment 5 different N (thenumber of filters) 20 40 60 80 100 With N = 40 theCNN has an acceptable performance on the validation dataWith N = 60 and above the CNN has almost the sameperformance as N = 40 but it takes more training epochsto research the acceptable performance Therefore we useN = 40 That is the character-level embedding of a tokenhas the dimension 40

522 Pre-trained word embeddingsOur experiments involve six Python libraries Pandas Mat-plotlib NumPy OpenGL React and JDBC We collect allquestions tagged with these six libraries and all answers tothese questions in the Stack Overflow Data Dump releasedon March 18 2018 We obtain a text corpus of 380971 postsWe use the same preprocessing and tokenization steps asdescribed in Section 51 to preprocess and tokenize thecontent of these posts Then we use the GloVe [28] to learnword embeddings from this text corpus We set the wordvocabulary size |V word| at 40000 The training epoch is setat 100 to ensure the sufficient training of word embeddingsWe experiment four different dimensions of word embed-dings dword 50 100 200 400 We use dword = 200 inour current implementation as it produces a good balanceof the training efficiency and the quality of word embed-dings for API extraction on the validation data We alsoexperiment pre-trained word embeddings with all StackOverflow posts which does not significantly affect the APIextraction performance but requires much longer time fortext preprocessing and word embeddings learning

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 8

523 Sentence-context Bi-LSTMFor the RNN we use 50 hidden LTSM units to store thehidden states The dimension of the hidden state vector is50 Therefore the dimension of the output vector esentw is100 (concatenating forward and backward hidden states) Inorder to mitigate overfitting [40] a dropout layer is addedon the output of BLSTM with the dropout rate 05

53 Model TrainingTo train our neural model we use input text and itscorresponding sequence of APInon-API labels (see Sec-tion 612 for our data labeling process) The optimizerused is Adam [41] which performs well for training RNNincluding Bi-LSTM [35] The training epoch is set at 40 timesThe best performance model on the validation set is savedfor testing This modelrsquos parameters are also saved for thetransfer learning experiments

6 EXPERIMENTS

We conduct a series of experiments to answer the followingfour research questionsbull RQ1 How well can our neural architecture learn multi-

level features from the input texts Can the learned fea-tures support high-quality API extraction compared withexisting machine learning based API extraction methods

bull RQ2 What is the impact of the three feature extractors(character-level CNN word embeddings and sentence-context Bi-LSTM) on the API extraction performance

bull RQ3 How effective is transfer learning for adapting APIextraction models across libraries of the same languagewith different amount of target-library training data

bull RQ4 How effective is transfer learning for adapting APIextraction models across libraries of different languageswith different amount of target-library training data

61 Experiments SetupThis section describes the libraries used in our experimentshow we prepare training and testing data and the evalua-tion metrics of API extraction performance

611 Studied librariesOur experiments involve three Python libraries (PandasNumPy and Matplotlib) one Java library (JDBC) oneJavaScript library (React) and one C library (OpenGL) Asreported in Section 2 these six libraries support very di-verse functionalities for computer programming and havedistinct API-naming and API mention characteristics Usingthese libraries we can evaluate the effectiveness of ourneural architecture for API extraction in very diverse datasettings We can also gain insights into the effectiveness oftransfer learning for API extraction and the transferabilityof our neural architecture in different settings

612 DatasetWe collect Stack Overflow posts (questions and their an-swers) tagged with pandas numpy matplotlib opengl reactor jdbc as our experimental data We use Stack OverflowData Dump released on March 18 2018 In this data dump

TABLE 3 Basic Statistics of Our Dataset

Library Posts Sentences API mentions TokensMatplotlib 600 4920 1481 47317

Numpy 600 2786 1552 39321Pandas 600 3522 1576 42267Opengl 600 3486 1674 70757JDBC 600 4205 1184 50861React 600 3110 1262 42282Total 3600 22029 8729 292805

380971 posts are tagged with one of the six studied librariesOur data collection process follows four criteria First thenumber of posts selected and the number of API mentionsin these posts should be at the same order of magnitudeSecond API mentions should exhibit the variations of APIwriting forms commonly seen on Stack Overflow Third theselection should avoid repeatedly selecting frequently men-tioned APIs Fourth the same post should not appear in thedataset for different libraries (one post may be tagged withtwo or more studied-library tags) The first criterion ensuresthe fairness of comparison across libraries the second andthird criteria ensure the representativeness of API mentionsin the dataset and the fourth criterion ensures that there isno repeated data in different datasets

We finally include 3600 posts (600 for each library) in ourdataset and each post has at least one API mention3 Table 3summarizes the basic statistics of our dataset These postshave in total 22029 sentences 292805 token occurrencesand 41486 unique tokens after data preprocessing and to-kenization We manually label the API mentions in theselected posts 6421 sentences (3407) contains at least oneAPI mention Our dataset has in total 8729 API mentionsThe selected Matplotlib NumPy Pandas OpenGL JDBC andReact posts have 1481 1552 1576 1674 1184 and 1262 APImentions which refer to 553 694 293 201 282 and 83unique APIs of the six libraries respectively In additionwe randomly sample 100 labelled Stack Overflow posts foreach library dataset Then we examine each API mention todetermine whether it is a simple name or not and we findthat among the 600 posts there are 1634 API mentions ofwhich 694 (4247) are simple API names Our dataset notonly contains a large number of API mentions in diversewriting forms but also contains rich discussion contextaround API mentions This makes it suitable for the trainingof our neural architecture the testing of its performanceand the study of model transferability

We randomly split the 600 posts of each library into threesubsets by 622 ratio 60 as training data 20 as validationdata for tuning model hyper-parameters and 20 as testingdata to answer our research questions Since cross validationwill cost a great deal of time and our training dataset is un-biased (proved in Section 64) cross validation is not usedOur experiment results also show that the training datasetis enough to train a high-performance neural networkBesides the performance of the model in target library canbe improved by adding more training dataset from otherlibraries And if more training dataset is added there is noneed to label a new validate dataset for the target libraryto re-tune the hyper-parameters because we find that the

3 httpsdrivegooglecomdrivefolders1f7ejNVUsew9l9uPCj4Xv5gMzNbqttpoausp=sharing

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 9

TABLE 4 Comparison of CRF Baselines [2] and Our Neural Model for API Extraction

Basic CRF Full CRF Our method

Library Prec Recall F1 Prec Recall F1 Prec Recall F1

Matplotlib 6735 4784 5662 8962 6131 7282 815 8392 827NumPy 7583 3991 5229 8921 4931 6351 7824 6291 6974Pandas 6206 7049 6601 9711 7121 8216 8273 853 8280Opengl 4391 9187 5942 9457 7062 8085 8583 8374 8477JDBC 1583 8261 2658 8732 5140 6471 8469 5531 6692React 1674 8858 2816 9742 7011 8153 8795 7817 8277

neural model has very similar performance with the samehyper-parameters under different dataset settings

613 Evaluation metricsWe use precision recall and F1-score to evaluate the per-formance of an API extraction method Precision measureswhat percentage the recognized API mentions by a methodare correct Recall measures what percentage the API men-tions in the testing dataset are recognized correctly by amethod F1-score is the harmonic mean of precision andrecall ie 2 lowast ((precision lowast recall)(precision+ recall))

62 Performance of Our Neural Model (RQ1)Motivation Recently linear-chain CRF has been used tosolve the API extraction problem in informal text [2] Theyshow that machine-learning based API extraction with thatof several outperforms several commonly-used rule-basedmethods [2] [42] [43] The approach in [2] relies on human-defined orthographic features two different unsupervisedlanguage models (class-based Brown clustering and neural-network based word embedding) and API gazetteer features(API inventory) In contrast our neural architecture usesneural networks to automatically learn character- word-and sentence-level features from the input texts The firstRQ is to confirm the effectiveness of these neural-networkfeature extractors for API extraction tasks by comparingthe overall performance of our neural architecture with theperformance of the linear-chain CRF with human-definedfeatures for API extractionApproach We use the implementation of the CRF modelin [2] to build the two CRF baselines The basic CRF baselineuses only the orthographic features developed in [2] Thefull CRF baseline uses all orthographic word-clusters andAPI gazetteer features developed in [2] The basic CRF base-line is easy to deploy because it uses only the orthographicfeatures in the input texts but not any advanced word-clusters and API gazetteer features And the self-trainingprocess is not used in these two baselines since we havesufficient training data In contrast the full CRF baselineuses advanced features so that it is not as easy-to-deploy asthe basic CRF But the full CRF has much better performancethan the basic CRF as reported in [2] Comparing our modelwith these two baselines we can understand whether ourmodel can achieve a good tradeoff between easy-to-deployand high performance Note that as Ye et al [2] alreadydemonstrates the superior performance of machine-learningbased API extraction methods over rule-based methods wedo not consider rule-based baselines for API extraction inthis studyResults From Table 4 we can see that

bull Although linear CRF with full features has close performance toour model linear CRF with only orthographic features performspoorly in distinguishing API tokens from non-API tokens Thebest F1-score of the basic CRF is only 066 for PandasThe F1-score of the basic CRF for Matplotlib Numpy andOpenGL is 052-059 For JDBC and React the F1-score ofthe basic CRF is below 03 Our results are consistent withthe results in [2] when advanced word-clusters and API-gazetteer features are ablated Compared with the basicCRF the full CRF performs much better For Pandas JDBCand React the full CRF has very close performance to ourmodel But for Matplotlib Numpy and OpenGL our modelstill outperforms the full CRF by at least four points inF1-score

bull Multi-level feature embedding by our neural architecture iseffective in distinguishing API tokens from non-API tokensin the resulting embedding space All evaluation metrics ofour neural architecture are significantly higher than thoseof the basic CRF Although our neural architecture per-forms relatively worse on NumPy and JDBC its F1-scoreis still much higher than the F1-score of the basic CRFon NumPy and JDBC We examine false positives (non-API token recognized as API) and false negatives (APItoken recognized as non-API) by our neural architectureon NumPy and JDBC We find that many false positivesand false negatives involve API mentions composed ofcomplex strings for example array expressions for Numpyor SQL statements for JDBC It seems that neural networkslearn some features from such complex strings that mayconfuse the model and cause the failure to tell apart APItokens from non-API ones Furthermore by analysing theclassification results for different API mentions we findthat our model has good generalizability on different APIfunctions

With linear CRF for API extraction we cannot achieve agood tradeoff between easy-to-deploy and high performance Incontrast our neural architecture can extract effective character-word- and sentence-level features from the input texts and theextracted features alone can support high-quality API extractionwithout using any hand-crafted advanced features such as wordclusters API gazetteers

63 Impact of Feature Extractors (RQ2)Motivation Although our neural architecture achieves verygood performance as a whole we would like to furtherinvestigate how much different feature extractors (character-level CNN word embeddings and sentence-context Bi-LSTM) contribute to this good performance and how dif-ferent features affect precision and recall of API extraction

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 10

TABLE 5 The Results of Feature Ablation Experiments

Ablating char CNN Ablating Word Embeddings Ablating Bi-LSTM All featuresPrec Recall F1-score Prec Recall F1-score Prec Recall F1-score Prec Recall F1-score

Matplotlib 7570 7232 7398 8175 6399 7179 8259 6934 7544 8150 8392 8270Numpy 8100 6040 6921 7931 5847 6731 8062 5667 6644 7824 6291 6974Pandas 8050 8328 8182 8177 6801 7425 8322 7522 7965 8273 8530 8280Opengl 7758 8537 8129 8307 6803 7504 9852 7208 8305 8583 8374 8477JDBC 6865 6125 6447 6422 6562 6491 9928 4312 6613 8469 5531 6692React 6990 8571 7701 8437 7500 7941 9879 6508 7847 8795 7817 8277

Approach We ablate one kind of feature at a time fromour neural architecture That is we obtain a model withoutcharacter-level CNN one without word embeddings andone without sentence-context Bi-LSTM We compare theperformance of these three models with that of the modelwith all three feature extractors

Results In Table 5 we highlight in bold the largest drop ineach metric for a library when ablating a feature comparedwith that metric with all features We underline the increasein each metric for a library when ablating a feature com-pared with that metric with all features We can see that

bull The performance of our neural architecture is contributed bythe combined action of all its features Ablating any of thefeatures the F1-score degrades Ablating word embed-dings or Bi-LSTM causes the largest drop in F1-scorefor four libraries while ablating char-CNN causes thelargest drop in F1-score for two libraries Feature ablationhas higher impact on some libraries than others Forexample ablating char-CNN word embeddings and Bi-LSTM all cause significant drop in F1-score for Matplotliband ablating word embeddings causes significant dropin F1-score for Pandas and React In contrast ablating afeature causes relative minor drop in F1-score for NumpyJDBC and React This indicates different levels of featureslearned by our neural architecture can be more distinct forsome libraries but more complementary for others Whendifferent features are more distinct from one anotherachieving good API extraction performance relies moreon all the features

bull Different features play different roles in distinguishing APItokens from non-API tokens Ablating char-CNN causesthe drop in precision for five libraries except NumpyThe precision degrades rather significantly for Matplotlib(71) OpenGL (96) JDBC (189) and React (205)In contrast ablating char-CNN causes significant drop inrecall only for Matplotlib (138) For JDBC and React ab-lating char-CNN even causes significant increase in recall(107 and 96 respectively) Different from the impactof ablating char-CNN ablating word embeddings and Bi-LSTM usually causes the largest drop in recall rangingfrom 99 drop for Numpy to 237 drop for MatplotlibAblating word embeddings causes significant drop inprecision only for JDBC and it causes small changes inprecision for the other five libraries (three with small dropand two with small increase) Ablating Bi-LSTM causesthe increase in precision for all six libraries Especiallyfor OpenGL JDBC and React when ablating Bi-LSTM themodel has almost perfect precision (around 99) but thiscomes with a significant drop in recall (139 for OpenGL193 for JDBC and 167 React)

All feature extractors are important for high-quality API ex-traction Char-CNN is especially useful for filtering out non-API tokens while word embeddings and Bi-LSTM are especiallyuseful for recalling API tokens

64 Effectiveness of Within-Language Transfer Learn-ing (RQ3)Motivation As described in Section 611 we intentionallychoose three different-functionality Python libraries PandasMatplotlib NumPy Pandas and NumPy are functionally sim-ilar while the other two pairs are more distant The threelibraries also have different API-naming and API-mentioncharacteristics (see Section 2) We want to investigate theeffectiveness of transfer learning for API extraction acrossdifferent pairs of libraries and with different amount oftarget-library training dataApproach We use one library as source library and one ofthe other two libraries as target library We have six transfer-learning pairs for the three libraries We denote them assource-library-name rarr target-library-name such as Pandas rarrNumPy which represents transferring Pandas-trained modelto NumPy text Two of these six pairs (Pandas rarr NumPy andNumPy rarr Pandas) are similar-libraries transfer (in terms oflibrary functionalities and API-namingmention character-istics) while the other four pairs involving Matplotlib arerelatively more-distant-libraries transfer

TABLE 6 NumPy (NP) or Pandas (PD) rarr Matplotlib (MPL)NPrarrMPL PDrarrMPL MPL

Prec Recall F1 Prec Recall F1 Prec Recall F111 8264 8929 8584 8102 8006 8058 8767 7827 827012 8184 8452 8316 7161 8333 7696 8138 7024 754014 7183 7589 7381 6788 7798 7265 8122 5536 658418 7056 7560 7298 6966 7381 7171 7500 5357 6250116 7356 7202 7278 6648 7202 6916 8070 2738 4089132 7256 7883 7369 7147 6935 7047 9750 1160 2074DU 7254 6607 6916 7699 5476 6400

TABLE 7 Matplotlib (MPL) or Pandas (PD) rarr NumPy (NP)MPLrarrNP PDrarrNP NP

Prec Recall F1 Prec Recall F1 Prec Recall F111 8685 7708 8168 7751 6750 7216 7824 6292 697412 788 8208 8041 7013 6751 6879 7513 6042 669714 9364 7667 8000 6588 7000 6781 7714 4500 568418 7373 7833 7596 6584 6667 6625 7103 4292 5351116 7619 6667 7111 5833 6417 6114 5707 4875 5258132 7554 5792 6556 6027 5625 5823 7272 2333 3533DU 6416 6525 6471 6296 5932 6108

We use gradually-reduced target-library training data(1 for all data 12 14 132) to fine-tune the source-library-trained model We also train a target-library modelfrom scratch (ie with randomly initialized model parame-ters) using the same proportion of the target-library trainingdata for comparison We also use the source-library-trained

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 11

TABLE 8 Matplotlib (MPL) or NumPy (NP) rarr Pandas (PD)MPLrarrPD NPrarrPD PD

Prec Recall F1 Prec Recall F1 Prec Recall F111 8497 8962 8723 8686 8761 8723 8043 8530 828012 8618 8443 8530 8397 8300 8348 8801 7406 804314 8107 8761 8421 8107 8270 8188 7650 7695 767218 8754 8098 8413 8576 7637 8079 6930 8329 7565

116 8204 7896 8047 8245 7579 7898 8421 5072 5331132 8131 7522 7814 8143 7205 7645 8425 3084 4514DU 7165 4006 5139 7500 3545 4814

TABLE 9 Averaged Matplotlib (MPL) or NumPy(NP)rarrPandas (PD)

MPLrarrPD NPrarrPDPrec Recall F1 Prec Recall F1

116 8333 7781 8043 8339 7522 7909132 7950 7378 7653 8615 7115 7830

model directly without any fine-tuning (ie 01 target-library data) as a baseline We also randomly select 116 and132 training data for NumPy rarr Pandas and Matplotlib rarrPandas for 10 times and train the model for each time Thenwe calculate the averaged precision and recall values Basedon the averaged precision and recall values we calculate theaveraged F1-scoreResults Table 6 Table 7 and Table 8 show the experimentresults of the six pairs of transfer learning Table 9 showsthe averaged experiment results of NumPy rarr Pandas andMatplotlib rarr Pandas with 116 and 132 training dataAcronyms stand for MPL (Matplotlib) NP (NumPy) PD(Pandas) DU (Direct Use) The last column is the results oftraining the target-library model from scratch We can seethatbull Transfer learning can produce a better-performance target-

library model than training the target-library model fromscratch with the same proportion of training data This is ev-ident as the F1-score of the target-library model obtainedby fine-tuning the source-library-trained model is higherthan the F1-score of the target-library model trained fromscratch in all the transfer settings but PD rarr MPL at11 Especially for NumPy the NumPy model trainedfrom scratch with all Numpy training data has F1-score6974 However the NumPy model transferred from theMatplotlib model has F1-score 8168 (171 improvement)For the Matplotlib rarr Nump transfer even with only 116Numpy training data for fine-tuning the F1-score (7111) ofthe transferred NumPy model is still higher than the F1-score of the NumPy model trained from scratch with allNumpy training data This suggests that transfer learningmay boost the model performance even for the difficultdataset Such performance boost by transfer learning hasalso been observed in many studies [14] [44] [45] Thereason is that a target-library model can ldquoreuserdquo muchknowledge in the source-library model rather than hav-ing to learning completely new knowledge from scratch

bull Transfer learning can reduce the amount of target-library train-ing data required to obtain a high-quality model For four outof six transfer-learning pairs (NP rarr MPL MPL rarr NP NPrarr PD MPL rarr PD) the reduction ranges from 50 (12)to 875 (78) while the resulting target-library model stillhas better F1-score than the target-library model trainedfrom scratch using all target-library training data If we al-low a small degradation of F1-score (eg 3 of the F1-sore

for the target-library model trained from scratch usingall target-library training data) the reduction can go upto 938 (1516) For example for Matplotlib rarr Pandasusing 116 Pandas training data (ie only about 20 posts)for fine-tuning the obtained Pandas model still has F1-score 8047

bull Transfer learning is very effective in few-shot training settingsFew-shot training refers to training a model with onlya very small amount of data [46] In our experimentsusing 132 training data ie about 10 posts for transferlearning the F1-score of the obtained target-library modelis still improved a few points for NP rarr MPL PD rarr MPLand MPL rarr NP and is significantly improved for MPL rarrPD (529) and NP rarr PD (583) compared with directlyreusing source-library-trained model without fine-tuning(DU row) Furthermore the averaged results of NP rarrPD and MPL rarr PD for 10 randomly selected 116 and132 training data are similar to the results in Table 8and the variances are all smaller than 03 which showsour training data is unbiased Although the target-librarymodel trained from scratch still has reasonable F1-scorewith 12 to 18 training data they become completelyuseless with 116 or 132 training data In contrast thetarget-library models obtained through transfer learninghave significant better F1-score in such few shot settingsFurthermore training a model from scratch using few-shot data may result in abnormal increase in precision(eg MPL at 116 and 132 NP at 132 PD at 132) or inrecall (eg NP at 116) compared with training the modelfrom scratch using more data This abnormal increase inprecision (or recall) comes with a sharp decrease in recall(or precision) Our analysis shows that this phenomenonis caused by the biased training data in few-shot settingsIn contrast the target-library models obtained throughtransfer learning can reuse knowledge in the source modeland produce a much more balanced precision and recall(thus much better F1-score) even in the face of biased few-shot training data

bull The effectiveness of transfer learning does not correlate withthe quality of source-library-trained model Based on a not-so-high-quality source model (eg NumPy) we can still ob-tain a high-quality target model through transfer learningFor example NP rarr PD results in a Pandas model with F1-score gt 80 using only 18 of Pandas training data On theother hand a high-quality source model (eg Pandas) maynot boost the performance of the target model Amongall 36 transfer settings we have one such case ie PDrarr MPL at 11 The Matplotlib model transferred fromthe Pandas model using all Matplotlib training data isslightly worse than the Matplotlib model trained fromscratch using all training data This can be attributed tothe differences of API naming characteristics between thetwo libraries (see the last bullet)

bull The more similar the functionalities and the API-namingmention characteristics between the two libraries arethe more effectiveness the transfer learning can be Pan-das and NumPy have similar functionalities and API-namingmention characteristics For PD rarr NP and NPrarr PD the target-library model is less than 5 worse inF1-score with 18 target-library training data comparedwith the target-library model trained from scratch using

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 12

all training data In contrast PD rarr MPL has 69 dropin F1-score with 12 training data and NP rarr MPL has107 drop in F1-score with 14 training data comparedwith the Matplotlib model trained from scratch using alltraining data

bull Knowledge about non-polysemous APIs seems easy to expandwhile knowledge about polysemous APIs seems difficult toadapt Matplotlib has the least polysemous APIs (16)and Pandas and NumPy has much more polysemous APIsBoth MPL rarr PD and MPL rarr NP have high-qualitytarget models even with 18 target-library training dataThe transfer learning seems to be able to add the newknowledge ldquocommon words can be APIsrdquo to the target-library model In contrast NP rarr MPL requires at 12training data to have a high-quality target model (F1-score8316) and for PD rarr MPL the Matplotlib model (F1-score8058) by transfer learning with all training data is evenslightly worse than the model trained from scratch (F1-score 827) These results suggest that it can be difficultto adapt the knowledge ldquocommon words can be APIsrdquolearned from NumPy and Pandas to Matplotlib which doesnot have many common-word APIs

Transfer learning can effectively boost the performance of target-library model with less demand on target-library training dataIts effectiveness correlates more with the similarity of libraryfunctionalities and and API-namingmention characteristicsthan the initial quality of source-library model

65 Effectiveness of Across-Language Transfer Learn-ing (RQ4)Motivation In the RQ3 we investigate the transferabilityof API extraction model across three Python libraries Inthe RQ4 we want to further investigate the transferabilityof API extraction model in a more challenging setting ieacross different programming languages and across librarieswith very different functionalitiesApproach For the source language we choose Python oneof the most popular programming languages We randomlyselect 200 posts in each dataset of the three Python librariesand combine these 600 posts as the training data for PythonAPI extraction model For the target languages we choosethree other popular programming languages Java JavaScriptand C That is we have three transfer-learning pairs in thisRQ Python rarr Java Python rarr JavaScript and Python rarrC For the three target languages we intentionally choosethe libraries that support very different functionalities fromthe three Python libraries For Java we choose JDBC (anAPI for accessing relational database) For JavaScript wechoose React (a library for web graphical user interface)For C we choose OpenGL (a library for 3D graphics) Asdescribed in Section 612 we label 600 posts for each target-language library for the experiments As in the RQ3 weuse gradually-reduced target-language training data to fine-tune the source-language-trained modelResults Table 10 Table 11 and Table 12 show the experimentresults for the three across-language transfer-learning pairsThese three tables use the same notation as Table 6 Table 7and Table 8 We can see thatbull Directly reusing the source-language-trained model on the

target-language text produces unacceptable performance For

TABLE 10 Python rarr Java

PythonrarrJava JavaPrec Recall F1 Prec Recall F1

11 7745 6656 7160 8469 5531 669212 7220 6250 6706 7738 5344 632214 6926 5000 5808 7122 4719 567718 5000 6406 5616 7515 3969 5194

116 5571 4825 5200 7569 3406 4698132 5699 4000 4483 7789 2312 3566DU 4444 2875 3491

TABLE 11 Python rarr JavaScript

PythonrarrJavaScript JavaScriptPrec Recall F1 Prec Recall F1

11 7745 8175 8619 8795 7817 827712 8693 7659 8143 8756 5984 777014 8684 6865 7424 8308 6626 737218 8148 6111 6984 8598 5595 6788

116 7111 6349 6808 8738 3848 5344132 6667 5238 5867 6521 3571 4615DU 5163 2500 3369

the three Python libraries directly reusing a source-library-trained model on the text of a target-library maystill have reasonable performance for example NP rarrMPL (F1-score 6916) PD rarr MPL (640) MPL rarr NP(6471) However for the three across-language transfer-learning pairs the F1-score of direct model reuse is verylow Python rarr Java (3491) Python rarr JavaScript (3369)Python rarr C (3595) These results suggest that there arestill certain level of commonalities between the librarieswith similar functionalities and of the same programminglanguage so that the knowledge learned in the source-library model may be largely reused in the target-librarymodel In contrast the libraries of different functionali-ties and programming languages have much fewer com-monalities which makes it infeasible to directly deploya source-language-trained model to the text of anotherlanguage In such cases we have to either train the modelfor each language or library from scratch or we may ex-ploit transfer learning to adapt a source-language-trainedmodel to the target-language text

bull Across-language transfer learning holds the same performancecharacteristics as within-language transfer learning but itdemands more target-language training data to obtain a high-quality model than within-language transfer learning ForPython rarr JavaScript and Python rarr C with as minimumas 116 target-language training data we can obtaina fine-tuned target-language model whose F1-score candouble that of directly reusing the Python-trained modelto JavaScript or C text For Python rarr Java fine-tuning thePython-trained model with 116 Java training data booststhe F1-score by 50 compared with directly applying thePython-trained model to Java text For all the transfer-learning settings in Table 10 Table 11 and Table 12the fine-tuned target-language model always outperformsthe corresponding target-language model trained fromscratch with the same amount of target-language trainingdata However it requires at least 12 target-languagetraining data to fine-tune a target-language model toachieve the same level of performance as the target-language model trained from scratch with all target-

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 13

TABLE 12 Python rarr C

PythonrarrC CPrec Recall F1 Prec Recall F1

11 8987 8509 8747 8583 8374 847712 8535 8373 8454 8628 7669 812114 8383 7588 7965 8219 7127 763418 7432 7371 7401 7868 6802 7297

116 7581 6945 7260 8852 5710 6942132 6962 6585 6779 8789 4525 5975DU 5649 2791 3595

language training data For the within-language transferlearning achieving this result may require as minimum as18 target-library training data

Software text of different programming languages and of li-braries with very different functionalities increases the difficultyof transfer learning but transfer learning can still effectivelyboost the performance of the target-language model with 12less target-language training data compared with training thetarget-language model from scratch with all training data

66 Threats to Validity

The major threat to internal validity is the API labelingerrors in the dataset In order to decrease errors the first twoauthors first independently label the same data and thenresolve the disagreements by discussion Furthermore inour experiments we have to examine many API extractionresults We only spot a few instances of overlooked orerroneously-labeled API mentions Therefore the quality ofour dataset is trustworthy

The major threat to external validity is the generalizationof our results and findings Although we invest significanttime and effort to prepare datasets conduct experimentsand analyze results our experiments involve only six li-braries of four programming languages The performance ofour neural architecture and especially the findings on trans-fer learning could be different with other programminglanguages and libraries Furthermore the software text inall the experiments comes from a single data source ieStack Overflow Software text from other data sources mayexhibit different API-mention and discussion-context char-acteristics which may affect the model performance In thefuture we will reduce this threat by applying our approachto more languageslibraries and informal software text fromother data sources (eg developer emails)

7 RELATED WORK

APIs are the core resources for software development andAPI related knowledge is widely present in software textssuch as API reference documentation QampA discussionsbug reports Researchers have proposed many API extrac-tion methods to support various software engineering tasksespecially document traceability recovery [42] [47] [48][49] [50] For example RecoDoc [3] extracts Java APIsfrom several resources and then recover traceability acrossdifferent sources Subramanian et al [4] use code contextinformation to filter candidate APIs in a knowledge basefor an API mention in a partial code fragment Christophand Robillard [51] extracts sentences about API usage from

Stack Overflow and use them to augment API referencedocumentation These works focus mainly on the API link-ing task ie linking API mentions in text or code to someAPI entities in a knowledge base In terms of extracting APImentions in software text they rely on rule-based methodsfor example regular expressions of distinct orthographicfeatures of APIs such as camelcase special characters (eg or ()) and API annotations

Several studies such as Bacchelli et al [42] [43] and Yeet al [2] show that regular expressions of distinct ortho-graphic features are not reliable for API extraction tasks ininformal texts such as emails Stack Overflow posts Bothvariations of sentence formats and the wide presence ofmentions of polysemous API simple names pose a greatchallenge for API extraction in informal texts Howeverthis challenge is generally avoided by considering onlyAPI mentions with distinct orthographic features in existingworks [3] [51] [52]

Island parsing provides a more robust solution for ex-tracting API mentions from texts Using an island grammarwe can separate the textual content into constructs of in-terest (island) and the remainder (water) [53] For exampleBacchelli et al [54] uses island parsing to extract code frag-ments from natural language text Rigby and Robillard [52]also use island parser to identify code-like elements that canpotentially be APIs However these island parsers cannoteffectively deal with mentions of API simple names that arenot suffixed by () such as apply and series in Fig 1 whichare very common writing forms of API methods in StackOverflow discussions [2]

Recently Ye et al [10] proposes a CRF based approachfor extracting methods of software entities such as pro-gramming languages libraries computing concepts andAPIs from informal software texts They report that extractAPI mentions is much more challenging than extractingother types of software entities A follow-up work by Yeet al [2] proposes to use a combination of orthographicword-clusters and API gazetteer features to enhance theCRFrsquos performance on API extraction tasks Although theseproposed features are effective they require much manualeffort to develop and tune Furthermore enough trainingdata has to be prepared and manually labeled for applyingtheir approach to the software text of each library to beprocessed These overheads pose a practical limitation todeploying their approach to a larger number of libraries

Our neural architecture is inspired by recent advancesof neural network techniques for NLP tasks For exampleboth RNNs and CNNs have been used to embed character-level features for question answering [55] [56] machinetranslation [57] text classification [58] and part-of-speechtagging [59] Some researchers also use word embeddingsand LSTMs for NER [35] [60] [61] To the best of our knowl-edge our neural architecture is the first machine learningbased API extraction method that combines these proposalsand customize them based on the characteristics of soft-ware texts and API names Furthermore the design of ourneural architecture also takes into account the deploymentoverhead of the API methods for multiple programminglanguages and libraries which has never been explored forAPI extraction tasks

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 14

8 CONCULSION

This paper presents a novel multi-layer neural architecturethat can effectively learn important character- word- andsentence-level features from informal software texts for ex-tracting API mentions in text The learned features has supe-rior performance than human-defined orthographic featuresfor API extraction in informal software texts This makes ourneural architecture easy to deploy because the only input itrequires is the texts to be processed In contrast existingmachine learning based API extraction methods have touse additional hand-crafted features such as word clustersor API gazetteers in order to achieve the performanceclose to that of our neural architecture Furthermore as thefeatures are automatically learned from the input texts ourneural architecture is easy to transfer and fine-tune acrossprogramming languages and libraries We demonstrate itstransferability across three Python libraries and across fourprogramming languages Our neural architecture togetherwith transfer learning makes it easy to train and deploya high-quality API extraction model for multiple program-ming languages and libraries with much less overall effortrequired for preparing training data and effective featuresIn the future we will further investigate the performanceand the transferability of our neural architecture in manyother programming languages and libraries moving to-wards real-world deployment of machine learning basedAPI extraction methods

REFERENCES

[1] C Parnin C Treude L Grammel and M-A Storey ldquoCrowddocumentation Exploring the coverage and the dynamics of apidiscussions on stack overflowrdquo

[2] D Ye Z Xing C Y Foo J Li and N Kapre ldquoLearning to extractapi mentions from informal natural language discussionsrdquo in Soft-ware Maintenance and Evolution (ICSME) 2016 IEEE InternationalConference on IEEE 2016 pp 389ndash399

[3] B Dagenais and M P Robillard ldquoRecovering traceability linksbetween an api and its learning resourcesrdquo in Software Engineering(ICSE) 2012 34th International Conference on IEEE 2012 pp 47ndash57

[4] S Subramanian L Inozemtseva and R Holmes ldquoLive api doc-umentationrdquo in Proceedings of the 36th International Conference onSoftware Engineering ACM 2014 pp 643ndash652

[5] C Chen Z Xing and X Wang ldquoUnsupervised software-specificmorphological forms inference from informal discussionsrdquo in Pro-ceedings of the 39th International Conference on Software EngineeringIEEE Press 2017 pp 450ndash461

[6] X Chen C Chen D Zhang and Z Xing ldquoSethesaurus Wordnetin software engineeringrdquo IEEE Transactions on Software Engineer-ing 2019

[7] H Li S Li J Sun Z Xing X Peng M Liu and X Zhao ldquoIm-proving api caveats accessibility by mining api caveats knowledgegraphrdquo in Software Maintenance and Evolution (ICSME) 2018 IEEEInternational Conference on IEEE 2018

[8] Z X D L X W Qiao Huang Xin Xia ldquoApi method recom-mendation without worrying about the task-api knowledge gaprdquoin Automated Software Engineering (ASE) 2018 33th IEEEACMInternational Conference on IEEE 2018

[9] B Xu Z Xing X Xia and D Lo ldquoAnswerbot automated gen-eration of answer summary to developersz technical questionsrdquoin Proceedings of the 32nd IEEEACM International Conference onAutomated Software Engineering IEEE Press 2017 pp 706ndash716

[10] D Ye Z Xing C Y Foo Z Q Ang J Li and N Kapre ldquoSoftware-specific named entity recognition in software engineering socialcontentrdquo in 2016 IEEE 23rd International Conference on SoftwareAnalysis Evolution and Reengineering (SANER) vol 1 IEEE 2016pp 90ndash101

[11] E F Tjong Kim Sang and F De Meulder ldquoIntroduction to the conll-2003 shared task Language-independent named entity recogni-tionrdquo in Proceedings of the seventh conference on Natural languagelearning at HLT-NAACL 2003-Volume 4 Association for Compu-tational Linguistics 2003 pp 142ndash147

[12] R Caruana ldquoLearning many related tasks at the same time withbackpropagationrdquo in Advances in neural information processing sys-tems 1995 pp 657ndash664

[13] A Arnold R Nallapati and W W Cohen ldquoExploiting featurehierarchy for transfer learning in named entity recognitionrdquo Pro-ceedings of ACL-08 HLT pp 245ndash253 2008

[14] J Yosinski J Clune Y Bengio and H Lipson ldquoHow transferableare features in deep neural networksrdquo in Advances in neuralinformation processing systems 2014 pp 3320ndash3328

[15] Z X Deheng Ye Lingfeng Bao and S-W Lin ldquoApireal An apirecognition and linking approach for online developer forumsrdquoEmpirical Software Engineering 2018 2018

[16] C dos Santos and M Gatti ldquoDeep convolutional neural networksfor sentiment analysis of short textsrdquo in Proceedings of COLING2014 the 25th International Conference on Computational LinguisticsTechnical Papers 2014 pp 69ndash78

[17] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[18] G Chen C Chen Z Xing and B Xu ldquoLearning a dual-languagevector space for domain-specific cross-lingual question retrievalrdquoin 2016 31st IEEEACM International Conference on Automated Soft-ware Engineering (ASE) IEEE 2016 pp 744ndash755

[19] C Chen X Chen J Sun Z Xing and G Li ldquoData-driven proac-tive policy assurance of post quality in community qampa sitesrdquoProceedings of the ACM on human-computer interaction vol 2 noCSCW p 33 2018

[20] I Bazzi ldquoModelling out-of-vocabulary words for robust speechrecognitionrdquo PhD dissertation Massachusetts Institute of Tech-nology 2002

[21] C Chen S Gao and Z Xing ldquoMining analogical libraries in qampadiscussionsndashincorporating relational and categorical knowledgeinto word embeddingrdquo in 2016 IEEE 23rd international conferenceon software analysis evolution and reengineering (SANER) vol 1IEEE 2016 pp 338ndash348

[22] C Chen and Z Xing ldquoSimilartech automatically recommendanalogical libraries across different programming languagesrdquo in2016 31st IEEEACM International Conference on Automated SoftwareEngineering (ASE) IEEE 2016 pp 834ndash839

[23] C Chen Z Xing and Y Liu ldquoWhats spains paris mining analogi-cal libraries from qampa discussionsrdquo Empirical Software Engineeringvol 24 no 3 pp 1155ndash1194 2019

[24] Y Huang C Chen Z Xing T Lin and Y Liu ldquoTell them apartdistilling technology differences from crowd-scale comparisondiscussionsrdquo in ASE 2018 pp 214ndash224

[25] O Levy and Y Goldberg ldquoNeural word embedding as implicitmatrix factorizationrdquo in Advances in neural information processingsystems 2014 pp 2177ndash2185

[26] D Tang F Wei N Yang M Zhou T Liu and B Qin ldquoLearningsentiment-specific word embedding for twitter sentiment classifi-cationrdquo in Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) vol 1 2014pp 1555ndash1565

[27] O Levy Y Goldberg and I Dagan ldquoImproving distributional sim-ilarity with lessons learned from word embeddingsrdquo Transactionsof the Association for Computational Linguistics vol 3 pp 211ndash2252015

[28] J Pennington R Socher and C Manning ldquoGlove Global vectorsfor word representationrdquo in Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP) 2014 pp1532ndash1543

[29] I Sutskever O Vinyals and Q V Le ldquoSequence to sequencelearning with neural networksrdquo in Advances in neural informationprocessing systems 2014 pp 3104ndash3112

[30] C Chen Z Xing and Y Liu ldquoBy the community amp for the com-munity a deep learning approach to assist collaborative editing inqampa sitesrdquo Proceedings of the ACM on Human-Computer Interactionvol 1 no CSCW p 32 2017

[31] S Gao C Chen Z Xing Y Ma W Song and S-W Lin ldquoA neuralmodel for method name generation from functional descriptionrdquo

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 15

in 2019 IEEE 26th International Conference on Software AnalysisEvolution and Reengineering (SANER) IEEE 2019 pp 414ndash421

[32] X Wang C Chen and Z Xing ldquoDomain-specific machine trans-lation with recurrent neural network for software localizationrdquoEmpirical Software Engineering pp 1ndash32 2019

[33] C Chen Z Xing Y Liu and K L X Ong ldquoMining likelyanalogical apis across third-party libraries via large-scale unsu-pervised api semantics embeddingrdquo IEEE Transactions on SoftwareEngineering 2019

[34] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no 11pp 2673ndash2681 1997

[35] Z Huang W Xu and K Yu ldquoBidirectional lstm-crf models forsequence taggingrdquo arXiv preprint arXiv150801991 2015

[36] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural computation vol 9 no 8 pp 1735ndash1780 1997

[37] H Sak A Senior and F Beaufays ldquoLong short-term memoryrecurrent neural network architectures for large scale acousticmodelingrdquo in Fifteenth annual conference of the international speechcommunication association 2014

[38] Y Bengio ldquoDeep learning of representations for unsupervised andtransfer learningrdquo in Proceedings of ICML Workshop on Unsupervisedand Transfer Learning 2012 pp 17ndash36

[39] S J Pan Q Yang et al ldquoA survey on transfer learningrdquo[40] N Srivastava G Hinton A Krizhevsky I Sutskever and

R Salakhutdinov ldquoDropout A simple way to prevent neural net-works from overfittingrdquo The Journal of Machine Learning Researchvol 15 no 1 pp 1929ndash1958 2014

[41] D P Kingma and J Ba ldquoAdam A method for stochastic optimiza-tionrdquo arXiv preprint arXiv14126980 2014

[42] A Bacchelli M Lanza and R Robbes ldquoLinking e-mails andsource code artifactsrdquo in Proceedings of the 32nd ACMIEEE Inter-national Conference on Software Engineering-Volume 1 ACM 2010pp 375ndash384

[43] A Bacchelli M DrsquoAmbros M Lanza and R Robbes ldquoBench-marking lightweight techniques to link e-mails and source coderdquoin Reverse Engineering 2009 WCRErsquo09 16th Working Conference onIEEE 2009 pp 205ndash214

[44] S J Pan and Q Yang ldquoA survey on transfer learningrdquo IEEETransactions on knowledge and data engineering vol 22 no 10 pp1345ndash1359 2010

[45] M Oquab L Bottou I Laptev and J Sivic ldquoLearning and transfer-ring mid-level image representations using convolutional neuralnetworksrdquo in Computer Vision and Pattern Recognition (CVPR) 2014IEEE Conference on IEEE 2014 pp 1717ndash1724

[46] S Ravi and H Larochelle ldquoOptimization as a model for few-shotlearningrdquo 2016

[47] A Marcus J Maletic et al ldquoRecovering documentation-to-source-code traceability links using latent semantic indexingrdquo in Soft-ware Engineering 2003 Proceedings 25th International Conference onIEEE 2003 pp 125ndash135

[48] H-Y Jiang T N Nguyen X Chen H Jaygarl and C K ChangldquoIncremental latent semantic indexing for automatic traceabil-ity link evolution managementrdquo in Proceedings of the 2008 23rdIEEEACM International Conference on Automated Software Engineer-ing IEEE Computer Society 2008 pp 59ndash68

[49] W Zheng Q Zhang and M Lyu ldquoCross-library api recommen-dation using web search enginesrdquo in Proceedings of the 19th ACMSIGSOFT symposium and the 13th European conference on Foundationsof software engineering ACM 2011 pp 480ndash483

[50] Q Gao H Zhang J Wang Y Xiong L Zhang and H Mei ldquoFixingrecurring crash bugs via analyzing qampa sites (t)rdquo in AutomatedSoftware Engineering (ASE) 2015 30th IEEEACM International Con-ference on IEEE 2015 pp 307ndash318

[51] C Treude and M P Robillard ldquoAugmenting api documentationwith insights from stack overflowrdquo in Software Engineering (ICSE)2016 IEEEACM 38th International Conference on IEEE 2016 pp392ndash403

[52] P C Rigby and M P Robillard ldquoDiscovering essential codeelements in informal documentationrdquo in Proceedings of the 2013International Conference on Software Engineering IEEE Press 2013pp 832ndash841

[53] L Moonen ldquoGenerating robust parsers using island grammarsrdquoin Reverse Engineering 2001 Proceedings Eighth Working Conferenceon IEEE 2001 pp 13ndash22

[54] A Bacchelli A Cleve M Lanza and A Mocci ldquoExtracting struc-tured data from natural language documents with island parsingrdquo

in Automated Software Engineering (ASE) 2011 26th IEEEACMInternational Conference on IEEE 2011 pp 476ndash479

[55] Y Kim Y Jernite D Sontag and A M Rush ldquoCharacter-awareneural language modelsrdquo in AAAI 2016 pp 2741ndash2749

[56] D Lukovnikov A Fischer J Lehmann and S Auer ldquoNeuralnetwork-based question answering over knowledge graphs onword and character levelrdquo in Proceedings of the 26th internationalconference on World Wide Web International World Wide WebConferences Steering Committee 2017 pp 1211ndash1220

[57] W Ling I Trancoso C Dyer and A W Black ldquoCharacter-basedneural machine translationrdquo arXiv preprint arXiv151104586 2015

[58] X Zhang J Zhao and Y LeCun ldquoCharacter-level convolutionalnetworks for text classificationrdquo in Advances in neural informationprocessing systems 2015 pp 649ndash657

[59] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[60] J P Chiu and E Nichols ldquoNamed entity recognition with bidirec-tional lstm-cnnsrdquo arXiv preprint arXiv151108308 2015

[61] G Lample M Ballesteros S Subramanian K Kawakami andC Dyer ldquoNeural architectures for named entity recognitionrdquoarXiv preprint arXiv160301360 2016

Suyu Ma is a research assistant in the Facultyof Information Technology at Monash UniversityHe has research interest in the areas of soft-ware engineering Deep learning and Human-computer Interaction He is currently focusing onimproving the usability and accessibility of mo-bile application He received the BS and MSdegrees from Beijing Technology and BusinessUniversity and the Australian National Universityin 2016 and 2018 respectively And he will bea PhD student at Monash University in 2020

under the supervision of Chunyang Chen

Zhenchang Xing is a Senior Lecturer in the Re-search School of Computer Science AustralianNational University Previously he was an Assis-tant Professor in the School of Computer Sci-ence and Engineering Nanyang TechnologicalUniversity Singapore from 2012-2016 Beforejoining NTU Dr Xing was a Lee Kuan Yew Re-search Fellow in the School of Computing Na-tional University of Singapore from 2009-2012Dr Xings current research area is in the interdis-plinary areas of software engineering human-

computer interaction and applied data analytics Dr Xing has over 100publications in peer-refereed journals and conference proceedings andhave received several distinguished paper awards from top softwareengineering conferences Dr Xing regularly serves on the organizationand program committees of the top software engineering conferencesand he will be the program committee co-chair for ICSME2020

Chunyang Chen is a lecturer (Assistant Pro-fessor) in Faculty of Information TechnologyMonash University Australia His research fo-cuses on software engineering deep learningand human-computer interaction He has pub-lished over 16 papers in referred journals orconferences including Empirical Software Engi-neering ICSE ASE CSCW ICSME SANERHe is a member of IEEE and ACM He receivedACM SIGSOFT distinguished paper award inASE 2018 best paper award in SANER 2016

and best tool demo in ASE 2016 httpschunyang-chengithubio

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 16

Cheng Chen received his Bachelor degree inSoftware Engineering from Northwest University(China) and received Master degree of Com-puting (major in Artificial Intelligence) from theAustralian National University He did some in-ternship projects about Natural Language Pro-cessing at CSIRO from 2016 to 2017 He workedas a research assistant supervised by Dr Zhen-chang Xing at ANU in 2018 Cheng currentlyworks in PricewaterhouseCoopers (PwC) Firmas a senior algorithm engineer of Natural Lan-

guage Processing and Data Analyzing Cheng is interested in NamedEnti ty Extraction Relation Extraction Text summarization and parallelcomputing He is working on knowledge engineering and transaction forNLP tasks

Lizhen Qu is a research fellow in the DialogueResearch lab in the Faculty of Information Tech-nology at Monash University He has exten-sive research experience in the areas of naturallanguage processing multimodal learning deeplearning and Cybersecurity He is currently fo-cusing on information extraction semantic pars-ing and multimodal dialogue systems Prior tojoining Monash University he worked as a re-search scientist at Data61CSIRO where he ledand participated in several research and indus-

trial projects including Deep Learning for Text and Deep Learning forCyber

Guoqiang Li is now an associate professor inschool of software Shanghai Jiao Tong Univer-sity and a guest associate professor in KyushuUniversity He received the BS MS and PhDdegrees from Taiyuan University of TechnologyShanghai Jiao Tong University and Japan Ad-vanced Institute of Science and Technology in2001 2005 and 2008 respectively He workedas a postdoctoral research fellow in the graduateschool of information science Nagoya Univer-sity Japan during 2008-2009 as an assistant

professor in the school of software Shanghai Jiao Tong Universityduring 2009-2013 and as an academic visitor in the department ofcomputer science University of Oxford during 2015-2016 His researchinterests include formal verification programming language theory anddata analytics and intelligence He published more than 40 researchespapers in the international journals and mainstream conferences includ-ing TDSC SPE TECS IJFCS SCIS CSCW ICSE FORMATS ATVAetc

Page 6: JOURNAL OF LA Easy-to-Deploy API Extraction by Multi-Level ... · sufficient high-quality training data for APIs of some less frequently discussed libraries or frameworks. Another

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 6

similar words the trained model will be restricted to theexamples that it sees in the training data

To address this dilemma we propose to exploit unsu-pervised word-embedding models [21] [22] [23] [24] tolearn distributed word representations from a large amountof unlabeled text discussing a particular library Manystudies [25] [26] [27] have shown that distributed wordrepresentations can capture rich semantics of words suchthat semantically similar words will have similar wordembeddings

Let V word be the vocabulary of words for the corpusof software texts to be processed As illustrated in Fig 3word-level embeddings are encoded by column vectors ina word embedding matrix Eword isin Rdwordtimes|V word| wheredword is the dimension of word embeddings and |V word|is the vocabulary size Each column Eword

i isin Rdword

corre-sponds to the word-level embedding of the i-th word inthe vocabulary V word We can obtain a token wrsquos word-level embedding eword

w by looking up Eword with the wordw ie eword

w = Ewordvw where vw is a one-hot vector ofsize |V word| which has value 1 at index w and zero in allother positions The matrix Eword is to be learned usingunsupervised word embedding models (eg GloVe [28])and dword is a hyper-parameter to be chosen by the user(see Section 52 for model configuration)

In this work we adopt the Global Vectors for WordRepresentation (GloVe) method [28] to learn the matrixEword GloVe is an unsupervised algorithm for learningword representations based on the statistics of word co-occurrences in an unlabeled text corpus It calculates theword embeddings based on a word co-occurrence matrixX Each row in the word co-occurrence matrix correspondsto a word and each column corresponds to a context Xij

is the frequency of word i co-occurring with word j andXi =

983123Xik (1 le k le |V word|) is the total number of occur-

rences of word i in the corpus The probability of word j thatoccurs in the context of word i is Pij = P (j|i) = XijXiWe have log(Pij) = log(Xij)minus log(Xi)

GloVe defines log(Pij) = eTwiewj where ewi and ewj

are the word embeddings to be learned for the word wi

and wj This gives the constraint for each word pair aslog(Xij) = eTwi

ewj + bi + bj where b is the bias termfor ew The cost function for minimizing the loss of wordembeddings is defined as

V word983131

ij=1

f(Xij)(eTwiewj

+ bi + bj minus log(Xij))

where f(Xij) is a weighting function That is GloVe learnsword embeddings by a weighted least square regressionmodel

43 Extracting Sentence-Context Features by Bi-LSTM

In informal software texts many API mentions cannot be re-liably recognized by simply examining a tokenrsquos character-level features and word semantics This is because manyAPIs are named using common English words (eg seriesapply plot) or common computing terms (eg dataframesigmoid histgram argmax zip list) When such APIs are men-tioned in their simple name this results in a common-word

polesemy issue for API extraction [2] In such situations wehave to disambiguate the API sense of a common word fromthe normal sense of the word

To that end we have to look into the sentence context inwhich an API is mentioned For example by looking intothe sentence context of the two sentences in Fig 1 we candetermine that the ldquoapplyrdquo in the first sentence is an APImention while the ldquoapplyrdquo in the second sentence is not anAPI mention Note that both the preceding and succeedingcontext of the token ldquoapplyrdquo are useful for disambiguatingthe API or the normal sense of the token

We use Recurrent Neural Network (RNN) to extractsentence context features for disambiguating the API or thenormal sense of the word [29] RNN is a class of neuralnetworks where connections between units form directedcycles and it is widely used in software engineering do-main [30] [31] [32] [33] Due to this nature it is especiallyuseful for tasks involving sequential inputs [29] like sen-tences In our task we adopt a Bidirectional RNN (Bi-RNN)architecture [34] [35] which is composed of two LSTMsone takes input from the beginning of the text forward tilla particular token while the other takes input from the endof the text backward till that token

The input to an RNN is a sequence of vectors In ourtask we obtain the input vector of a token in the inputtext by concatenating the character-level embedding of thetoken w and the word-level embedding of the token wie echarw oplus eword

w An RNN recursively maps an inputvector xt and a hidden state htminus1 to a new hidden state htht = f(htminus1 xt) where f is a non-linear activation function(eg an LSTM unit used in this work) A hidden stateis a vector esent isin Rdsent

summarizing the sentence-levelfeatures till the input xt where dsent is the dimension of thehidden state vector to be chosen by the user We denote efsentand ebsent as the hidden states computed by the forwardLSTM and the backward LSTM after reading the end of thepreceding and the succeeding sentence context of a tokenrespectively efsent and ebsent are concatenated into one vectoresentw as the Bi-LSTM output for the corresponding token win the input text

As an input text (eg a Stack Overflow post) can be along text modeling long-range dependencies in the text iscrucial for our task For example a mention of a libraryname at the beginning of a post could be important fordetecting a mention of a method of this library later inthe post Therefore we adopt the LSTM unit [36] [37] inour RNN The LSTM is designed to cope with the gradientvanishing problem in RNN An LSTM unit consists of amemory cell and three gates namely the input outputand forget gates Conceptually the memory cell stores thepast contexts and the input and output gates allow thecell to store contexts for a long period of time Meanwhilesome contexts can be cleared by the forget gate Memorycell and the three gates have weights and bias terms tobe learned during model training Bi-LSTM can extract thecontext feature For instance in the Pandas library sentencerdquoThis can be accomplished quite simply with the DataFramemethod applyrdquo based on the context information from theword method the Bi-LSTM can help our model classifythe word apply as API mention and the learning can be

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 7

transferred across different languages and libraries withtransfer learning

44 API Labeling by Softmax ClassificationGiven the Bi-LSTMrsquos output vector esentw for a token w inthe input text we train a binary softmax classifier to predictwhether the token is an API mention or a normal Englishword That is we have two classes in this work ie API ornon-API Softmax predicts the probability for the j-th classgiven a tokenrsquos vector esentw by

P (j|esentw ) =exp(esentw WT

j )9831232

k=1 exp(esentw WT

k )

where the vectors Wk (k=1 or 2) are parameters to belearned

45 Library Adaptation by Transfer LearningTransfer learning [12] [38] is a machine learning methodthat stores knowledge gained while solving one problemand applying it to a different but related problem It helpsto reduce the amount of training data required for the targetproblem When transfer learning is used in deep learningit has been shown to be beneficial for narrowing down thescope of possible models on a task by using the weights ofa trained model on a different but related task [14] and theshared parameters transfer learning can help model adaptthe shared knowledge in similar context [39]

API extraction tasks for different libraries can be re-garded as a set of different but related tasks Therefore weuse transfer learning to adapt a neural model trained withone libraryrsquos text (source-library-trained model) for anotherlibraryrsquos text (target library) Our neural architecture is com-posed of four main components character-level CNN wordembeddings sentence Bi-LSTM and softmax classifier Wecan transfer all or some source-library-trained model com-ponents to a target library model Without transfer learn-ing the parameters of a target-library model componentwill be randomly initialized and then learned using thetarget-library training data ie trained from scratch Withtransfer learning we use the parameters of source-library-trained model components to initialize the parameters ofthe corresponding target-library model components Aftertransferring the model parameters we can either freeze thetransferred parameters or fine-tune the transferred parame-ters using the target-library training data

5 SYSTEM IMPLEMENTATION

This section describes the current implementation 2 of ourneural architecture for API extraction

51 Preprocessing and Tokenizing Input TextOur current implementation takes as input the content of aStack Overflow post As we want to recognize API mentionswithin natural language sentences we remove stand-alonecode snippets in ltpregtltcodegt tag We then remove allHTML tags in the post content to obtain the plain text

2 httpsgithubcomJOJO201API Extraction

input (see Section 3 for the justification of taking plain textas input) We develop a sentence parser to split the pre-processed post text into sentences by punctuation Follow-ing [2] we develop a software-specific tokenizer to tokenizethe sentences This tokenizer preserves the integrity of code-like tokens For example it treats matplotlibpyplotimshow()as a single token instead of a sequence of 7 tokens ieldquomatplotlibrdquo ldquordquo ldquopyplotrdquo ldquordquo ldquoimshowrdquo ldquo(rdquo ldquo)rdquo producedby general English tokenizers

52 Model ConfigurationWe now describe the hyper-parameter settings used in ourcurrent implementation These hyper-parameters are tunedusing the validation data during model training (see Sec-tion 612 for the description of our dataset) We find thatthe neural model has very similar performance across sixlibraries dataset with the same hyper-parameters Thereforewe keep the same hyper-parameters for the six librariesdataset which can also avoid the difficulty in scaling dif-ferent hyper-parameters

521 Character-level CNNWe set the filter window size h = 3 That is the convolutionoperation extracts local features from 3 adjacent charactersat a time The size of our current character vocabulary V char

is 92 Thus we set the dimensionality of the character em-bedding dchar at 92 We initialize the character embeddingwith one-hot vector (1 at one character index and zeroin all other dimensions) The character embeddings willbe updated through back propagation during the trainingof character-level CNN We experiment 5 different N (thenumber of filters) 20 40 60 80 100 With N = 40 theCNN has an acceptable performance on the validation dataWith N = 60 and above the CNN has almost the sameperformance as N = 40 but it takes more training epochsto research the acceptable performance Therefore we useN = 40 That is the character-level embedding of a tokenhas the dimension 40

522 Pre-trained word embeddingsOur experiments involve six Python libraries Pandas Mat-plotlib NumPy OpenGL React and JDBC We collect allquestions tagged with these six libraries and all answers tothese questions in the Stack Overflow Data Dump releasedon March 18 2018 We obtain a text corpus of 380971 postsWe use the same preprocessing and tokenization steps asdescribed in Section 51 to preprocess and tokenize thecontent of these posts Then we use the GloVe [28] to learnword embeddings from this text corpus We set the wordvocabulary size |V word| at 40000 The training epoch is setat 100 to ensure the sufficient training of word embeddingsWe experiment four different dimensions of word embed-dings dword 50 100 200 400 We use dword = 200 inour current implementation as it produces a good balanceof the training efficiency and the quality of word embed-dings for API extraction on the validation data We alsoexperiment pre-trained word embeddings with all StackOverflow posts which does not significantly affect the APIextraction performance but requires much longer time fortext preprocessing and word embeddings learning

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 8

523 Sentence-context Bi-LSTMFor the RNN we use 50 hidden LTSM units to store thehidden states The dimension of the hidden state vector is50 Therefore the dimension of the output vector esentw is100 (concatenating forward and backward hidden states) Inorder to mitigate overfitting [40] a dropout layer is addedon the output of BLSTM with the dropout rate 05

53 Model TrainingTo train our neural model we use input text and itscorresponding sequence of APInon-API labels (see Sec-tion 612 for our data labeling process) The optimizerused is Adam [41] which performs well for training RNNincluding Bi-LSTM [35] The training epoch is set at 40 timesThe best performance model on the validation set is savedfor testing This modelrsquos parameters are also saved for thetransfer learning experiments

6 EXPERIMENTS

We conduct a series of experiments to answer the followingfour research questionsbull RQ1 How well can our neural architecture learn multi-

level features from the input texts Can the learned fea-tures support high-quality API extraction compared withexisting machine learning based API extraction methods

bull RQ2 What is the impact of the three feature extractors(character-level CNN word embeddings and sentence-context Bi-LSTM) on the API extraction performance

bull RQ3 How effective is transfer learning for adapting APIextraction models across libraries of the same languagewith different amount of target-library training data

bull RQ4 How effective is transfer learning for adapting APIextraction models across libraries of different languageswith different amount of target-library training data

61 Experiments SetupThis section describes the libraries used in our experimentshow we prepare training and testing data and the evalua-tion metrics of API extraction performance

611 Studied librariesOur experiments involve three Python libraries (PandasNumPy and Matplotlib) one Java library (JDBC) oneJavaScript library (React) and one C library (OpenGL) Asreported in Section 2 these six libraries support very di-verse functionalities for computer programming and havedistinct API-naming and API mention characteristics Usingthese libraries we can evaluate the effectiveness of ourneural architecture for API extraction in very diverse datasettings We can also gain insights into the effectiveness oftransfer learning for API extraction and the transferabilityof our neural architecture in different settings

612 DatasetWe collect Stack Overflow posts (questions and their an-swers) tagged with pandas numpy matplotlib opengl reactor jdbc as our experimental data We use Stack OverflowData Dump released on March 18 2018 In this data dump

TABLE 3 Basic Statistics of Our Dataset

Library Posts Sentences API mentions TokensMatplotlib 600 4920 1481 47317

Numpy 600 2786 1552 39321Pandas 600 3522 1576 42267Opengl 600 3486 1674 70757JDBC 600 4205 1184 50861React 600 3110 1262 42282Total 3600 22029 8729 292805

380971 posts are tagged with one of the six studied librariesOur data collection process follows four criteria First thenumber of posts selected and the number of API mentionsin these posts should be at the same order of magnitudeSecond API mentions should exhibit the variations of APIwriting forms commonly seen on Stack Overflow Third theselection should avoid repeatedly selecting frequently men-tioned APIs Fourth the same post should not appear in thedataset for different libraries (one post may be tagged withtwo or more studied-library tags) The first criterion ensuresthe fairness of comparison across libraries the second andthird criteria ensure the representativeness of API mentionsin the dataset and the fourth criterion ensures that there isno repeated data in different datasets

We finally include 3600 posts (600 for each library) in ourdataset and each post has at least one API mention3 Table 3summarizes the basic statistics of our dataset These postshave in total 22029 sentences 292805 token occurrencesand 41486 unique tokens after data preprocessing and to-kenization We manually label the API mentions in theselected posts 6421 sentences (3407) contains at least oneAPI mention Our dataset has in total 8729 API mentionsThe selected Matplotlib NumPy Pandas OpenGL JDBC andReact posts have 1481 1552 1576 1674 1184 and 1262 APImentions which refer to 553 694 293 201 282 and 83unique APIs of the six libraries respectively In additionwe randomly sample 100 labelled Stack Overflow posts foreach library dataset Then we examine each API mention todetermine whether it is a simple name or not and we findthat among the 600 posts there are 1634 API mentions ofwhich 694 (4247) are simple API names Our dataset notonly contains a large number of API mentions in diversewriting forms but also contains rich discussion contextaround API mentions This makes it suitable for the trainingof our neural architecture the testing of its performanceand the study of model transferability

We randomly split the 600 posts of each library into threesubsets by 622 ratio 60 as training data 20 as validationdata for tuning model hyper-parameters and 20 as testingdata to answer our research questions Since cross validationwill cost a great deal of time and our training dataset is un-biased (proved in Section 64) cross validation is not usedOur experiment results also show that the training datasetis enough to train a high-performance neural networkBesides the performance of the model in target library canbe improved by adding more training dataset from otherlibraries And if more training dataset is added there is noneed to label a new validate dataset for the target libraryto re-tune the hyper-parameters because we find that the

3 httpsdrivegooglecomdrivefolders1f7ejNVUsew9l9uPCj4Xv5gMzNbqttpoausp=sharing

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 9

TABLE 4 Comparison of CRF Baselines [2] and Our Neural Model for API Extraction

Basic CRF Full CRF Our method

Library Prec Recall F1 Prec Recall F1 Prec Recall F1

Matplotlib 6735 4784 5662 8962 6131 7282 815 8392 827NumPy 7583 3991 5229 8921 4931 6351 7824 6291 6974Pandas 6206 7049 6601 9711 7121 8216 8273 853 8280Opengl 4391 9187 5942 9457 7062 8085 8583 8374 8477JDBC 1583 8261 2658 8732 5140 6471 8469 5531 6692React 1674 8858 2816 9742 7011 8153 8795 7817 8277

neural model has very similar performance with the samehyper-parameters under different dataset settings

613 Evaluation metricsWe use precision recall and F1-score to evaluate the per-formance of an API extraction method Precision measureswhat percentage the recognized API mentions by a methodare correct Recall measures what percentage the API men-tions in the testing dataset are recognized correctly by amethod F1-score is the harmonic mean of precision andrecall ie 2 lowast ((precision lowast recall)(precision+ recall))

62 Performance of Our Neural Model (RQ1)Motivation Recently linear-chain CRF has been used tosolve the API extraction problem in informal text [2] Theyshow that machine-learning based API extraction with thatof several outperforms several commonly-used rule-basedmethods [2] [42] [43] The approach in [2] relies on human-defined orthographic features two different unsupervisedlanguage models (class-based Brown clustering and neural-network based word embedding) and API gazetteer features(API inventory) In contrast our neural architecture usesneural networks to automatically learn character- word-and sentence-level features from the input texts The firstRQ is to confirm the effectiveness of these neural-networkfeature extractors for API extraction tasks by comparingthe overall performance of our neural architecture with theperformance of the linear-chain CRF with human-definedfeatures for API extractionApproach We use the implementation of the CRF modelin [2] to build the two CRF baselines The basic CRF baselineuses only the orthographic features developed in [2] Thefull CRF baseline uses all orthographic word-clusters andAPI gazetteer features developed in [2] The basic CRF base-line is easy to deploy because it uses only the orthographicfeatures in the input texts but not any advanced word-clusters and API gazetteer features And the self-trainingprocess is not used in these two baselines since we havesufficient training data In contrast the full CRF baselineuses advanced features so that it is not as easy-to-deploy asthe basic CRF But the full CRF has much better performancethan the basic CRF as reported in [2] Comparing our modelwith these two baselines we can understand whether ourmodel can achieve a good tradeoff between easy-to-deployand high performance Note that as Ye et al [2] alreadydemonstrates the superior performance of machine-learningbased API extraction methods over rule-based methods wedo not consider rule-based baselines for API extraction inthis studyResults From Table 4 we can see that

bull Although linear CRF with full features has close performance toour model linear CRF with only orthographic features performspoorly in distinguishing API tokens from non-API tokens Thebest F1-score of the basic CRF is only 066 for PandasThe F1-score of the basic CRF for Matplotlib Numpy andOpenGL is 052-059 For JDBC and React the F1-score ofthe basic CRF is below 03 Our results are consistent withthe results in [2] when advanced word-clusters and API-gazetteer features are ablated Compared with the basicCRF the full CRF performs much better For Pandas JDBCand React the full CRF has very close performance to ourmodel But for Matplotlib Numpy and OpenGL our modelstill outperforms the full CRF by at least four points inF1-score

bull Multi-level feature embedding by our neural architecture iseffective in distinguishing API tokens from non-API tokensin the resulting embedding space All evaluation metrics ofour neural architecture are significantly higher than thoseof the basic CRF Although our neural architecture per-forms relatively worse on NumPy and JDBC its F1-scoreis still much higher than the F1-score of the basic CRFon NumPy and JDBC We examine false positives (non-API token recognized as API) and false negatives (APItoken recognized as non-API) by our neural architectureon NumPy and JDBC We find that many false positivesand false negatives involve API mentions composed ofcomplex strings for example array expressions for Numpyor SQL statements for JDBC It seems that neural networkslearn some features from such complex strings that mayconfuse the model and cause the failure to tell apart APItokens from non-API ones Furthermore by analysing theclassification results for different API mentions we findthat our model has good generalizability on different APIfunctions

With linear CRF for API extraction we cannot achieve agood tradeoff between easy-to-deploy and high performance Incontrast our neural architecture can extract effective character-word- and sentence-level features from the input texts and theextracted features alone can support high-quality API extractionwithout using any hand-crafted advanced features such as wordclusters API gazetteers

63 Impact of Feature Extractors (RQ2)Motivation Although our neural architecture achieves verygood performance as a whole we would like to furtherinvestigate how much different feature extractors (character-level CNN word embeddings and sentence-context Bi-LSTM) contribute to this good performance and how dif-ferent features affect precision and recall of API extraction

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 10

TABLE 5 The Results of Feature Ablation Experiments

Ablating char CNN Ablating Word Embeddings Ablating Bi-LSTM All featuresPrec Recall F1-score Prec Recall F1-score Prec Recall F1-score Prec Recall F1-score

Matplotlib 7570 7232 7398 8175 6399 7179 8259 6934 7544 8150 8392 8270Numpy 8100 6040 6921 7931 5847 6731 8062 5667 6644 7824 6291 6974Pandas 8050 8328 8182 8177 6801 7425 8322 7522 7965 8273 8530 8280Opengl 7758 8537 8129 8307 6803 7504 9852 7208 8305 8583 8374 8477JDBC 6865 6125 6447 6422 6562 6491 9928 4312 6613 8469 5531 6692React 6990 8571 7701 8437 7500 7941 9879 6508 7847 8795 7817 8277

Approach We ablate one kind of feature at a time fromour neural architecture That is we obtain a model withoutcharacter-level CNN one without word embeddings andone without sentence-context Bi-LSTM We compare theperformance of these three models with that of the modelwith all three feature extractors

Results In Table 5 we highlight in bold the largest drop ineach metric for a library when ablating a feature comparedwith that metric with all features We underline the increasein each metric for a library when ablating a feature com-pared with that metric with all features We can see that

bull The performance of our neural architecture is contributed bythe combined action of all its features Ablating any of thefeatures the F1-score degrades Ablating word embed-dings or Bi-LSTM causes the largest drop in F1-scorefor four libraries while ablating char-CNN causes thelargest drop in F1-score for two libraries Feature ablationhas higher impact on some libraries than others Forexample ablating char-CNN word embeddings and Bi-LSTM all cause significant drop in F1-score for Matplotliband ablating word embeddings causes significant dropin F1-score for Pandas and React In contrast ablating afeature causes relative minor drop in F1-score for NumpyJDBC and React This indicates different levels of featureslearned by our neural architecture can be more distinct forsome libraries but more complementary for others Whendifferent features are more distinct from one anotherachieving good API extraction performance relies moreon all the features

bull Different features play different roles in distinguishing APItokens from non-API tokens Ablating char-CNN causesthe drop in precision for five libraries except NumpyThe precision degrades rather significantly for Matplotlib(71) OpenGL (96) JDBC (189) and React (205)In contrast ablating char-CNN causes significant drop inrecall only for Matplotlib (138) For JDBC and React ab-lating char-CNN even causes significant increase in recall(107 and 96 respectively) Different from the impactof ablating char-CNN ablating word embeddings and Bi-LSTM usually causes the largest drop in recall rangingfrom 99 drop for Numpy to 237 drop for MatplotlibAblating word embeddings causes significant drop inprecision only for JDBC and it causes small changes inprecision for the other five libraries (three with small dropand two with small increase) Ablating Bi-LSTM causesthe increase in precision for all six libraries Especiallyfor OpenGL JDBC and React when ablating Bi-LSTM themodel has almost perfect precision (around 99) but thiscomes with a significant drop in recall (139 for OpenGL193 for JDBC and 167 React)

All feature extractors are important for high-quality API ex-traction Char-CNN is especially useful for filtering out non-API tokens while word embeddings and Bi-LSTM are especiallyuseful for recalling API tokens

64 Effectiveness of Within-Language Transfer Learn-ing (RQ3)Motivation As described in Section 611 we intentionallychoose three different-functionality Python libraries PandasMatplotlib NumPy Pandas and NumPy are functionally sim-ilar while the other two pairs are more distant The threelibraries also have different API-naming and API-mentioncharacteristics (see Section 2) We want to investigate theeffectiveness of transfer learning for API extraction acrossdifferent pairs of libraries and with different amount oftarget-library training dataApproach We use one library as source library and one ofthe other two libraries as target library We have six transfer-learning pairs for the three libraries We denote them assource-library-name rarr target-library-name such as Pandas rarrNumPy which represents transferring Pandas-trained modelto NumPy text Two of these six pairs (Pandas rarr NumPy andNumPy rarr Pandas) are similar-libraries transfer (in terms oflibrary functionalities and API-namingmention character-istics) while the other four pairs involving Matplotlib arerelatively more-distant-libraries transfer

TABLE 6 NumPy (NP) or Pandas (PD) rarr Matplotlib (MPL)NPrarrMPL PDrarrMPL MPL

Prec Recall F1 Prec Recall F1 Prec Recall F111 8264 8929 8584 8102 8006 8058 8767 7827 827012 8184 8452 8316 7161 8333 7696 8138 7024 754014 7183 7589 7381 6788 7798 7265 8122 5536 658418 7056 7560 7298 6966 7381 7171 7500 5357 6250116 7356 7202 7278 6648 7202 6916 8070 2738 4089132 7256 7883 7369 7147 6935 7047 9750 1160 2074DU 7254 6607 6916 7699 5476 6400

TABLE 7 Matplotlib (MPL) or Pandas (PD) rarr NumPy (NP)MPLrarrNP PDrarrNP NP

Prec Recall F1 Prec Recall F1 Prec Recall F111 8685 7708 8168 7751 6750 7216 7824 6292 697412 788 8208 8041 7013 6751 6879 7513 6042 669714 9364 7667 8000 6588 7000 6781 7714 4500 568418 7373 7833 7596 6584 6667 6625 7103 4292 5351116 7619 6667 7111 5833 6417 6114 5707 4875 5258132 7554 5792 6556 6027 5625 5823 7272 2333 3533DU 6416 6525 6471 6296 5932 6108

We use gradually-reduced target-library training data(1 for all data 12 14 132) to fine-tune the source-library-trained model We also train a target-library modelfrom scratch (ie with randomly initialized model parame-ters) using the same proportion of the target-library trainingdata for comparison We also use the source-library-trained

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 11

TABLE 8 Matplotlib (MPL) or NumPy (NP) rarr Pandas (PD)MPLrarrPD NPrarrPD PD

Prec Recall F1 Prec Recall F1 Prec Recall F111 8497 8962 8723 8686 8761 8723 8043 8530 828012 8618 8443 8530 8397 8300 8348 8801 7406 804314 8107 8761 8421 8107 8270 8188 7650 7695 767218 8754 8098 8413 8576 7637 8079 6930 8329 7565

116 8204 7896 8047 8245 7579 7898 8421 5072 5331132 8131 7522 7814 8143 7205 7645 8425 3084 4514DU 7165 4006 5139 7500 3545 4814

TABLE 9 Averaged Matplotlib (MPL) or NumPy(NP)rarrPandas (PD)

MPLrarrPD NPrarrPDPrec Recall F1 Prec Recall F1

116 8333 7781 8043 8339 7522 7909132 7950 7378 7653 8615 7115 7830

model directly without any fine-tuning (ie 01 target-library data) as a baseline We also randomly select 116 and132 training data for NumPy rarr Pandas and Matplotlib rarrPandas for 10 times and train the model for each time Thenwe calculate the averaged precision and recall values Basedon the averaged precision and recall values we calculate theaveraged F1-scoreResults Table 6 Table 7 and Table 8 show the experimentresults of the six pairs of transfer learning Table 9 showsthe averaged experiment results of NumPy rarr Pandas andMatplotlib rarr Pandas with 116 and 132 training dataAcronyms stand for MPL (Matplotlib) NP (NumPy) PD(Pandas) DU (Direct Use) The last column is the results oftraining the target-library model from scratch We can seethatbull Transfer learning can produce a better-performance target-

library model than training the target-library model fromscratch with the same proportion of training data This is ev-ident as the F1-score of the target-library model obtainedby fine-tuning the source-library-trained model is higherthan the F1-score of the target-library model trained fromscratch in all the transfer settings but PD rarr MPL at11 Especially for NumPy the NumPy model trainedfrom scratch with all Numpy training data has F1-score6974 However the NumPy model transferred from theMatplotlib model has F1-score 8168 (171 improvement)For the Matplotlib rarr Nump transfer even with only 116Numpy training data for fine-tuning the F1-score (7111) ofthe transferred NumPy model is still higher than the F1-score of the NumPy model trained from scratch with allNumpy training data This suggests that transfer learningmay boost the model performance even for the difficultdataset Such performance boost by transfer learning hasalso been observed in many studies [14] [44] [45] Thereason is that a target-library model can ldquoreuserdquo muchknowledge in the source-library model rather than hav-ing to learning completely new knowledge from scratch

bull Transfer learning can reduce the amount of target-library train-ing data required to obtain a high-quality model For four outof six transfer-learning pairs (NP rarr MPL MPL rarr NP NPrarr PD MPL rarr PD) the reduction ranges from 50 (12)to 875 (78) while the resulting target-library model stillhas better F1-score than the target-library model trainedfrom scratch using all target-library training data If we al-low a small degradation of F1-score (eg 3 of the F1-sore

for the target-library model trained from scratch usingall target-library training data) the reduction can go upto 938 (1516) For example for Matplotlib rarr Pandasusing 116 Pandas training data (ie only about 20 posts)for fine-tuning the obtained Pandas model still has F1-score 8047

bull Transfer learning is very effective in few-shot training settingsFew-shot training refers to training a model with onlya very small amount of data [46] In our experimentsusing 132 training data ie about 10 posts for transferlearning the F1-score of the obtained target-library modelis still improved a few points for NP rarr MPL PD rarr MPLand MPL rarr NP and is significantly improved for MPL rarrPD (529) and NP rarr PD (583) compared with directlyreusing source-library-trained model without fine-tuning(DU row) Furthermore the averaged results of NP rarrPD and MPL rarr PD for 10 randomly selected 116 and132 training data are similar to the results in Table 8and the variances are all smaller than 03 which showsour training data is unbiased Although the target-librarymodel trained from scratch still has reasonable F1-scorewith 12 to 18 training data they become completelyuseless with 116 or 132 training data In contrast thetarget-library models obtained through transfer learninghave significant better F1-score in such few shot settingsFurthermore training a model from scratch using few-shot data may result in abnormal increase in precision(eg MPL at 116 and 132 NP at 132 PD at 132) or inrecall (eg NP at 116) compared with training the modelfrom scratch using more data This abnormal increase inprecision (or recall) comes with a sharp decrease in recall(or precision) Our analysis shows that this phenomenonis caused by the biased training data in few-shot settingsIn contrast the target-library models obtained throughtransfer learning can reuse knowledge in the source modeland produce a much more balanced precision and recall(thus much better F1-score) even in the face of biased few-shot training data

bull The effectiveness of transfer learning does not correlate withthe quality of source-library-trained model Based on a not-so-high-quality source model (eg NumPy) we can still ob-tain a high-quality target model through transfer learningFor example NP rarr PD results in a Pandas model with F1-score gt 80 using only 18 of Pandas training data On theother hand a high-quality source model (eg Pandas) maynot boost the performance of the target model Amongall 36 transfer settings we have one such case ie PDrarr MPL at 11 The Matplotlib model transferred fromthe Pandas model using all Matplotlib training data isslightly worse than the Matplotlib model trained fromscratch using all training data This can be attributed tothe differences of API naming characteristics between thetwo libraries (see the last bullet)

bull The more similar the functionalities and the API-namingmention characteristics between the two libraries arethe more effectiveness the transfer learning can be Pan-das and NumPy have similar functionalities and API-namingmention characteristics For PD rarr NP and NPrarr PD the target-library model is less than 5 worse inF1-score with 18 target-library training data comparedwith the target-library model trained from scratch using

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 12

all training data In contrast PD rarr MPL has 69 dropin F1-score with 12 training data and NP rarr MPL has107 drop in F1-score with 14 training data comparedwith the Matplotlib model trained from scratch using alltraining data

bull Knowledge about non-polysemous APIs seems easy to expandwhile knowledge about polysemous APIs seems difficult toadapt Matplotlib has the least polysemous APIs (16)and Pandas and NumPy has much more polysemous APIsBoth MPL rarr PD and MPL rarr NP have high-qualitytarget models even with 18 target-library training dataThe transfer learning seems to be able to add the newknowledge ldquocommon words can be APIsrdquo to the target-library model In contrast NP rarr MPL requires at 12training data to have a high-quality target model (F1-score8316) and for PD rarr MPL the Matplotlib model (F1-score8058) by transfer learning with all training data is evenslightly worse than the model trained from scratch (F1-score 827) These results suggest that it can be difficultto adapt the knowledge ldquocommon words can be APIsrdquolearned from NumPy and Pandas to Matplotlib which doesnot have many common-word APIs

Transfer learning can effectively boost the performance of target-library model with less demand on target-library training dataIts effectiveness correlates more with the similarity of libraryfunctionalities and and API-namingmention characteristicsthan the initial quality of source-library model

65 Effectiveness of Across-Language Transfer Learn-ing (RQ4)Motivation In the RQ3 we investigate the transferabilityof API extraction model across three Python libraries Inthe RQ4 we want to further investigate the transferabilityof API extraction model in a more challenging setting ieacross different programming languages and across librarieswith very different functionalitiesApproach For the source language we choose Python oneof the most popular programming languages We randomlyselect 200 posts in each dataset of the three Python librariesand combine these 600 posts as the training data for PythonAPI extraction model For the target languages we choosethree other popular programming languages Java JavaScriptand C That is we have three transfer-learning pairs in thisRQ Python rarr Java Python rarr JavaScript and Python rarrC For the three target languages we intentionally choosethe libraries that support very different functionalities fromthe three Python libraries For Java we choose JDBC (anAPI for accessing relational database) For JavaScript wechoose React (a library for web graphical user interface)For C we choose OpenGL (a library for 3D graphics) Asdescribed in Section 612 we label 600 posts for each target-language library for the experiments As in the RQ3 weuse gradually-reduced target-language training data to fine-tune the source-language-trained modelResults Table 10 Table 11 and Table 12 show the experimentresults for the three across-language transfer-learning pairsThese three tables use the same notation as Table 6 Table 7and Table 8 We can see thatbull Directly reusing the source-language-trained model on the

target-language text produces unacceptable performance For

TABLE 10 Python rarr Java

PythonrarrJava JavaPrec Recall F1 Prec Recall F1

11 7745 6656 7160 8469 5531 669212 7220 6250 6706 7738 5344 632214 6926 5000 5808 7122 4719 567718 5000 6406 5616 7515 3969 5194

116 5571 4825 5200 7569 3406 4698132 5699 4000 4483 7789 2312 3566DU 4444 2875 3491

TABLE 11 Python rarr JavaScript

PythonrarrJavaScript JavaScriptPrec Recall F1 Prec Recall F1

11 7745 8175 8619 8795 7817 827712 8693 7659 8143 8756 5984 777014 8684 6865 7424 8308 6626 737218 8148 6111 6984 8598 5595 6788

116 7111 6349 6808 8738 3848 5344132 6667 5238 5867 6521 3571 4615DU 5163 2500 3369

the three Python libraries directly reusing a source-library-trained model on the text of a target-library maystill have reasonable performance for example NP rarrMPL (F1-score 6916) PD rarr MPL (640) MPL rarr NP(6471) However for the three across-language transfer-learning pairs the F1-score of direct model reuse is verylow Python rarr Java (3491) Python rarr JavaScript (3369)Python rarr C (3595) These results suggest that there arestill certain level of commonalities between the librarieswith similar functionalities and of the same programminglanguage so that the knowledge learned in the source-library model may be largely reused in the target-librarymodel In contrast the libraries of different functionali-ties and programming languages have much fewer com-monalities which makes it infeasible to directly deploya source-language-trained model to the text of anotherlanguage In such cases we have to either train the modelfor each language or library from scratch or we may ex-ploit transfer learning to adapt a source-language-trainedmodel to the target-language text

bull Across-language transfer learning holds the same performancecharacteristics as within-language transfer learning but itdemands more target-language training data to obtain a high-quality model than within-language transfer learning ForPython rarr JavaScript and Python rarr C with as minimumas 116 target-language training data we can obtaina fine-tuned target-language model whose F1-score candouble that of directly reusing the Python-trained modelto JavaScript or C text For Python rarr Java fine-tuning thePython-trained model with 116 Java training data booststhe F1-score by 50 compared with directly applying thePython-trained model to Java text For all the transfer-learning settings in Table 10 Table 11 and Table 12the fine-tuned target-language model always outperformsthe corresponding target-language model trained fromscratch with the same amount of target-language trainingdata However it requires at least 12 target-languagetraining data to fine-tune a target-language model toachieve the same level of performance as the target-language model trained from scratch with all target-

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 13

TABLE 12 Python rarr C

PythonrarrC CPrec Recall F1 Prec Recall F1

11 8987 8509 8747 8583 8374 847712 8535 8373 8454 8628 7669 812114 8383 7588 7965 8219 7127 763418 7432 7371 7401 7868 6802 7297

116 7581 6945 7260 8852 5710 6942132 6962 6585 6779 8789 4525 5975DU 5649 2791 3595

language training data For the within-language transferlearning achieving this result may require as minimum as18 target-library training data

Software text of different programming languages and of li-braries with very different functionalities increases the difficultyof transfer learning but transfer learning can still effectivelyboost the performance of the target-language model with 12less target-language training data compared with training thetarget-language model from scratch with all training data

66 Threats to Validity

The major threat to internal validity is the API labelingerrors in the dataset In order to decrease errors the first twoauthors first independently label the same data and thenresolve the disagreements by discussion Furthermore inour experiments we have to examine many API extractionresults We only spot a few instances of overlooked orerroneously-labeled API mentions Therefore the quality ofour dataset is trustworthy

The major threat to external validity is the generalizationof our results and findings Although we invest significanttime and effort to prepare datasets conduct experimentsand analyze results our experiments involve only six li-braries of four programming languages The performance ofour neural architecture and especially the findings on trans-fer learning could be different with other programminglanguages and libraries Furthermore the software text inall the experiments comes from a single data source ieStack Overflow Software text from other data sources mayexhibit different API-mention and discussion-context char-acteristics which may affect the model performance In thefuture we will reduce this threat by applying our approachto more languageslibraries and informal software text fromother data sources (eg developer emails)

7 RELATED WORK

APIs are the core resources for software development andAPI related knowledge is widely present in software textssuch as API reference documentation QampA discussionsbug reports Researchers have proposed many API extrac-tion methods to support various software engineering tasksespecially document traceability recovery [42] [47] [48][49] [50] For example RecoDoc [3] extracts Java APIsfrom several resources and then recover traceability acrossdifferent sources Subramanian et al [4] use code contextinformation to filter candidate APIs in a knowledge basefor an API mention in a partial code fragment Christophand Robillard [51] extracts sentences about API usage from

Stack Overflow and use them to augment API referencedocumentation These works focus mainly on the API link-ing task ie linking API mentions in text or code to someAPI entities in a knowledge base In terms of extracting APImentions in software text they rely on rule-based methodsfor example regular expressions of distinct orthographicfeatures of APIs such as camelcase special characters (eg or ()) and API annotations

Several studies such as Bacchelli et al [42] [43] and Yeet al [2] show that regular expressions of distinct ortho-graphic features are not reliable for API extraction tasks ininformal texts such as emails Stack Overflow posts Bothvariations of sentence formats and the wide presence ofmentions of polysemous API simple names pose a greatchallenge for API extraction in informal texts Howeverthis challenge is generally avoided by considering onlyAPI mentions with distinct orthographic features in existingworks [3] [51] [52]

Island parsing provides a more robust solution for ex-tracting API mentions from texts Using an island grammarwe can separate the textual content into constructs of in-terest (island) and the remainder (water) [53] For exampleBacchelli et al [54] uses island parsing to extract code frag-ments from natural language text Rigby and Robillard [52]also use island parser to identify code-like elements that canpotentially be APIs However these island parsers cannoteffectively deal with mentions of API simple names that arenot suffixed by () such as apply and series in Fig 1 whichare very common writing forms of API methods in StackOverflow discussions [2]

Recently Ye et al [10] proposes a CRF based approachfor extracting methods of software entities such as pro-gramming languages libraries computing concepts andAPIs from informal software texts They report that extractAPI mentions is much more challenging than extractingother types of software entities A follow-up work by Yeet al [2] proposes to use a combination of orthographicword-clusters and API gazetteer features to enhance theCRFrsquos performance on API extraction tasks Although theseproposed features are effective they require much manualeffort to develop and tune Furthermore enough trainingdata has to be prepared and manually labeled for applyingtheir approach to the software text of each library to beprocessed These overheads pose a practical limitation todeploying their approach to a larger number of libraries

Our neural architecture is inspired by recent advancesof neural network techniques for NLP tasks For exampleboth RNNs and CNNs have been used to embed character-level features for question answering [55] [56] machinetranslation [57] text classification [58] and part-of-speechtagging [59] Some researchers also use word embeddingsand LSTMs for NER [35] [60] [61] To the best of our knowl-edge our neural architecture is the first machine learningbased API extraction method that combines these proposalsand customize them based on the characteristics of soft-ware texts and API names Furthermore the design of ourneural architecture also takes into account the deploymentoverhead of the API methods for multiple programminglanguages and libraries which has never been explored forAPI extraction tasks

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 14

8 CONCULSION

This paper presents a novel multi-layer neural architecturethat can effectively learn important character- word- andsentence-level features from informal software texts for ex-tracting API mentions in text The learned features has supe-rior performance than human-defined orthographic featuresfor API extraction in informal software texts This makes ourneural architecture easy to deploy because the only input itrequires is the texts to be processed In contrast existingmachine learning based API extraction methods have touse additional hand-crafted features such as word clustersor API gazetteers in order to achieve the performanceclose to that of our neural architecture Furthermore as thefeatures are automatically learned from the input texts ourneural architecture is easy to transfer and fine-tune acrossprogramming languages and libraries We demonstrate itstransferability across three Python libraries and across fourprogramming languages Our neural architecture togetherwith transfer learning makes it easy to train and deploya high-quality API extraction model for multiple program-ming languages and libraries with much less overall effortrequired for preparing training data and effective featuresIn the future we will further investigate the performanceand the transferability of our neural architecture in manyother programming languages and libraries moving to-wards real-world deployment of machine learning basedAPI extraction methods

REFERENCES

[1] C Parnin C Treude L Grammel and M-A Storey ldquoCrowddocumentation Exploring the coverage and the dynamics of apidiscussions on stack overflowrdquo

[2] D Ye Z Xing C Y Foo J Li and N Kapre ldquoLearning to extractapi mentions from informal natural language discussionsrdquo in Soft-ware Maintenance and Evolution (ICSME) 2016 IEEE InternationalConference on IEEE 2016 pp 389ndash399

[3] B Dagenais and M P Robillard ldquoRecovering traceability linksbetween an api and its learning resourcesrdquo in Software Engineering(ICSE) 2012 34th International Conference on IEEE 2012 pp 47ndash57

[4] S Subramanian L Inozemtseva and R Holmes ldquoLive api doc-umentationrdquo in Proceedings of the 36th International Conference onSoftware Engineering ACM 2014 pp 643ndash652

[5] C Chen Z Xing and X Wang ldquoUnsupervised software-specificmorphological forms inference from informal discussionsrdquo in Pro-ceedings of the 39th International Conference on Software EngineeringIEEE Press 2017 pp 450ndash461

[6] X Chen C Chen D Zhang and Z Xing ldquoSethesaurus Wordnetin software engineeringrdquo IEEE Transactions on Software Engineer-ing 2019

[7] H Li S Li J Sun Z Xing X Peng M Liu and X Zhao ldquoIm-proving api caveats accessibility by mining api caveats knowledgegraphrdquo in Software Maintenance and Evolution (ICSME) 2018 IEEEInternational Conference on IEEE 2018

[8] Z X D L X W Qiao Huang Xin Xia ldquoApi method recom-mendation without worrying about the task-api knowledge gaprdquoin Automated Software Engineering (ASE) 2018 33th IEEEACMInternational Conference on IEEE 2018

[9] B Xu Z Xing X Xia and D Lo ldquoAnswerbot automated gen-eration of answer summary to developersz technical questionsrdquoin Proceedings of the 32nd IEEEACM International Conference onAutomated Software Engineering IEEE Press 2017 pp 706ndash716

[10] D Ye Z Xing C Y Foo Z Q Ang J Li and N Kapre ldquoSoftware-specific named entity recognition in software engineering socialcontentrdquo in 2016 IEEE 23rd International Conference on SoftwareAnalysis Evolution and Reengineering (SANER) vol 1 IEEE 2016pp 90ndash101

[11] E F Tjong Kim Sang and F De Meulder ldquoIntroduction to the conll-2003 shared task Language-independent named entity recogni-tionrdquo in Proceedings of the seventh conference on Natural languagelearning at HLT-NAACL 2003-Volume 4 Association for Compu-tational Linguistics 2003 pp 142ndash147

[12] R Caruana ldquoLearning many related tasks at the same time withbackpropagationrdquo in Advances in neural information processing sys-tems 1995 pp 657ndash664

[13] A Arnold R Nallapati and W W Cohen ldquoExploiting featurehierarchy for transfer learning in named entity recognitionrdquo Pro-ceedings of ACL-08 HLT pp 245ndash253 2008

[14] J Yosinski J Clune Y Bengio and H Lipson ldquoHow transferableare features in deep neural networksrdquo in Advances in neuralinformation processing systems 2014 pp 3320ndash3328

[15] Z X Deheng Ye Lingfeng Bao and S-W Lin ldquoApireal An apirecognition and linking approach for online developer forumsrdquoEmpirical Software Engineering 2018 2018

[16] C dos Santos and M Gatti ldquoDeep convolutional neural networksfor sentiment analysis of short textsrdquo in Proceedings of COLING2014 the 25th International Conference on Computational LinguisticsTechnical Papers 2014 pp 69ndash78

[17] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[18] G Chen C Chen Z Xing and B Xu ldquoLearning a dual-languagevector space for domain-specific cross-lingual question retrievalrdquoin 2016 31st IEEEACM International Conference on Automated Soft-ware Engineering (ASE) IEEE 2016 pp 744ndash755

[19] C Chen X Chen J Sun Z Xing and G Li ldquoData-driven proac-tive policy assurance of post quality in community qampa sitesrdquoProceedings of the ACM on human-computer interaction vol 2 noCSCW p 33 2018

[20] I Bazzi ldquoModelling out-of-vocabulary words for robust speechrecognitionrdquo PhD dissertation Massachusetts Institute of Tech-nology 2002

[21] C Chen S Gao and Z Xing ldquoMining analogical libraries in qampadiscussionsndashincorporating relational and categorical knowledgeinto word embeddingrdquo in 2016 IEEE 23rd international conferenceon software analysis evolution and reengineering (SANER) vol 1IEEE 2016 pp 338ndash348

[22] C Chen and Z Xing ldquoSimilartech automatically recommendanalogical libraries across different programming languagesrdquo in2016 31st IEEEACM International Conference on Automated SoftwareEngineering (ASE) IEEE 2016 pp 834ndash839

[23] C Chen Z Xing and Y Liu ldquoWhats spains paris mining analogi-cal libraries from qampa discussionsrdquo Empirical Software Engineeringvol 24 no 3 pp 1155ndash1194 2019

[24] Y Huang C Chen Z Xing T Lin and Y Liu ldquoTell them apartdistilling technology differences from crowd-scale comparisondiscussionsrdquo in ASE 2018 pp 214ndash224

[25] O Levy and Y Goldberg ldquoNeural word embedding as implicitmatrix factorizationrdquo in Advances in neural information processingsystems 2014 pp 2177ndash2185

[26] D Tang F Wei N Yang M Zhou T Liu and B Qin ldquoLearningsentiment-specific word embedding for twitter sentiment classifi-cationrdquo in Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) vol 1 2014pp 1555ndash1565

[27] O Levy Y Goldberg and I Dagan ldquoImproving distributional sim-ilarity with lessons learned from word embeddingsrdquo Transactionsof the Association for Computational Linguistics vol 3 pp 211ndash2252015

[28] J Pennington R Socher and C Manning ldquoGlove Global vectorsfor word representationrdquo in Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP) 2014 pp1532ndash1543

[29] I Sutskever O Vinyals and Q V Le ldquoSequence to sequencelearning with neural networksrdquo in Advances in neural informationprocessing systems 2014 pp 3104ndash3112

[30] C Chen Z Xing and Y Liu ldquoBy the community amp for the com-munity a deep learning approach to assist collaborative editing inqampa sitesrdquo Proceedings of the ACM on Human-Computer Interactionvol 1 no CSCW p 32 2017

[31] S Gao C Chen Z Xing Y Ma W Song and S-W Lin ldquoA neuralmodel for method name generation from functional descriptionrdquo

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 15

in 2019 IEEE 26th International Conference on Software AnalysisEvolution and Reengineering (SANER) IEEE 2019 pp 414ndash421

[32] X Wang C Chen and Z Xing ldquoDomain-specific machine trans-lation with recurrent neural network for software localizationrdquoEmpirical Software Engineering pp 1ndash32 2019

[33] C Chen Z Xing Y Liu and K L X Ong ldquoMining likelyanalogical apis across third-party libraries via large-scale unsu-pervised api semantics embeddingrdquo IEEE Transactions on SoftwareEngineering 2019

[34] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no 11pp 2673ndash2681 1997

[35] Z Huang W Xu and K Yu ldquoBidirectional lstm-crf models forsequence taggingrdquo arXiv preprint arXiv150801991 2015

[36] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural computation vol 9 no 8 pp 1735ndash1780 1997

[37] H Sak A Senior and F Beaufays ldquoLong short-term memoryrecurrent neural network architectures for large scale acousticmodelingrdquo in Fifteenth annual conference of the international speechcommunication association 2014

[38] Y Bengio ldquoDeep learning of representations for unsupervised andtransfer learningrdquo in Proceedings of ICML Workshop on Unsupervisedand Transfer Learning 2012 pp 17ndash36

[39] S J Pan Q Yang et al ldquoA survey on transfer learningrdquo[40] N Srivastava G Hinton A Krizhevsky I Sutskever and

R Salakhutdinov ldquoDropout A simple way to prevent neural net-works from overfittingrdquo The Journal of Machine Learning Researchvol 15 no 1 pp 1929ndash1958 2014

[41] D P Kingma and J Ba ldquoAdam A method for stochastic optimiza-tionrdquo arXiv preprint arXiv14126980 2014

[42] A Bacchelli M Lanza and R Robbes ldquoLinking e-mails andsource code artifactsrdquo in Proceedings of the 32nd ACMIEEE Inter-national Conference on Software Engineering-Volume 1 ACM 2010pp 375ndash384

[43] A Bacchelli M DrsquoAmbros M Lanza and R Robbes ldquoBench-marking lightweight techniques to link e-mails and source coderdquoin Reverse Engineering 2009 WCRErsquo09 16th Working Conference onIEEE 2009 pp 205ndash214

[44] S J Pan and Q Yang ldquoA survey on transfer learningrdquo IEEETransactions on knowledge and data engineering vol 22 no 10 pp1345ndash1359 2010

[45] M Oquab L Bottou I Laptev and J Sivic ldquoLearning and transfer-ring mid-level image representations using convolutional neuralnetworksrdquo in Computer Vision and Pattern Recognition (CVPR) 2014IEEE Conference on IEEE 2014 pp 1717ndash1724

[46] S Ravi and H Larochelle ldquoOptimization as a model for few-shotlearningrdquo 2016

[47] A Marcus J Maletic et al ldquoRecovering documentation-to-source-code traceability links using latent semantic indexingrdquo in Soft-ware Engineering 2003 Proceedings 25th International Conference onIEEE 2003 pp 125ndash135

[48] H-Y Jiang T N Nguyen X Chen H Jaygarl and C K ChangldquoIncremental latent semantic indexing for automatic traceabil-ity link evolution managementrdquo in Proceedings of the 2008 23rdIEEEACM International Conference on Automated Software Engineer-ing IEEE Computer Society 2008 pp 59ndash68

[49] W Zheng Q Zhang and M Lyu ldquoCross-library api recommen-dation using web search enginesrdquo in Proceedings of the 19th ACMSIGSOFT symposium and the 13th European conference on Foundationsof software engineering ACM 2011 pp 480ndash483

[50] Q Gao H Zhang J Wang Y Xiong L Zhang and H Mei ldquoFixingrecurring crash bugs via analyzing qampa sites (t)rdquo in AutomatedSoftware Engineering (ASE) 2015 30th IEEEACM International Con-ference on IEEE 2015 pp 307ndash318

[51] C Treude and M P Robillard ldquoAugmenting api documentationwith insights from stack overflowrdquo in Software Engineering (ICSE)2016 IEEEACM 38th International Conference on IEEE 2016 pp392ndash403

[52] P C Rigby and M P Robillard ldquoDiscovering essential codeelements in informal documentationrdquo in Proceedings of the 2013International Conference on Software Engineering IEEE Press 2013pp 832ndash841

[53] L Moonen ldquoGenerating robust parsers using island grammarsrdquoin Reverse Engineering 2001 Proceedings Eighth Working Conferenceon IEEE 2001 pp 13ndash22

[54] A Bacchelli A Cleve M Lanza and A Mocci ldquoExtracting struc-tured data from natural language documents with island parsingrdquo

in Automated Software Engineering (ASE) 2011 26th IEEEACMInternational Conference on IEEE 2011 pp 476ndash479

[55] Y Kim Y Jernite D Sontag and A M Rush ldquoCharacter-awareneural language modelsrdquo in AAAI 2016 pp 2741ndash2749

[56] D Lukovnikov A Fischer J Lehmann and S Auer ldquoNeuralnetwork-based question answering over knowledge graphs onword and character levelrdquo in Proceedings of the 26th internationalconference on World Wide Web International World Wide WebConferences Steering Committee 2017 pp 1211ndash1220

[57] W Ling I Trancoso C Dyer and A W Black ldquoCharacter-basedneural machine translationrdquo arXiv preprint arXiv151104586 2015

[58] X Zhang J Zhao and Y LeCun ldquoCharacter-level convolutionalnetworks for text classificationrdquo in Advances in neural informationprocessing systems 2015 pp 649ndash657

[59] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[60] J P Chiu and E Nichols ldquoNamed entity recognition with bidirec-tional lstm-cnnsrdquo arXiv preprint arXiv151108308 2015

[61] G Lample M Ballesteros S Subramanian K Kawakami andC Dyer ldquoNeural architectures for named entity recognitionrdquoarXiv preprint arXiv160301360 2016

Suyu Ma is a research assistant in the Facultyof Information Technology at Monash UniversityHe has research interest in the areas of soft-ware engineering Deep learning and Human-computer Interaction He is currently focusing onimproving the usability and accessibility of mo-bile application He received the BS and MSdegrees from Beijing Technology and BusinessUniversity and the Australian National Universityin 2016 and 2018 respectively And he will bea PhD student at Monash University in 2020

under the supervision of Chunyang Chen

Zhenchang Xing is a Senior Lecturer in the Re-search School of Computer Science AustralianNational University Previously he was an Assis-tant Professor in the School of Computer Sci-ence and Engineering Nanyang TechnologicalUniversity Singapore from 2012-2016 Beforejoining NTU Dr Xing was a Lee Kuan Yew Re-search Fellow in the School of Computing Na-tional University of Singapore from 2009-2012Dr Xings current research area is in the interdis-plinary areas of software engineering human-

computer interaction and applied data analytics Dr Xing has over 100publications in peer-refereed journals and conference proceedings andhave received several distinguished paper awards from top softwareengineering conferences Dr Xing regularly serves on the organizationand program committees of the top software engineering conferencesand he will be the program committee co-chair for ICSME2020

Chunyang Chen is a lecturer (Assistant Pro-fessor) in Faculty of Information TechnologyMonash University Australia His research fo-cuses on software engineering deep learningand human-computer interaction He has pub-lished over 16 papers in referred journals orconferences including Empirical Software Engi-neering ICSE ASE CSCW ICSME SANERHe is a member of IEEE and ACM He receivedACM SIGSOFT distinguished paper award inASE 2018 best paper award in SANER 2016

and best tool demo in ASE 2016 httpschunyang-chengithubio

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 16

Cheng Chen received his Bachelor degree inSoftware Engineering from Northwest University(China) and received Master degree of Com-puting (major in Artificial Intelligence) from theAustralian National University He did some in-ternship projects about Natural Language Pro-cessing at CSIRO from 2016 to 2017 He workedas a research assistant supervised by Dr Zhen-chang Xing at ANU in 2018 Cheng currentlyworks in PricewaterhouseCoopers (PwC) Firmas a senior algorithm engineer of Natural Lan-

guage Processing and Data Analyzing Cheng is interested in NamedEnti ty Extraction Relation Extraction Text summarization and parallelcomputing He is working on knowledge engineering and transaction forNLP tasks

Lizhen Qu is a research fellow in the DialogueResearch lab in the Faculty of Information Tech-nology at Monash University He has exten-sive research experience in the areas of naturallanguage processing multimodal learning deeplearning and Cybersecurity He is currently fo-cusing on information extraction semantic pars-ing and multimodal dialogue systems Prior tojoining Monash University he worked as a re-search scientist at Data61CSIRO where he ledand participated in several research and indus-

trial projects including Deep Learning for Text and Deep Learning forCyber

Guoqiang Li is now an associate professor inschool of software Shanghai Jiao Tong Univer-sity and a guest associate professor in KyushuUniversity He received the BS MS and PhDdegrees from Taiyuan University of TechnologyShanghai Jiao Tong University and Japan Ad-vanced Institute of Science and Technology in2001 2005 and 2008 respectively He workedas a postdoctoral research fellow in the graduateschool of information science Nagoya Univer-sity Japan during 2008-2009 as an assistant

professor in the school of software Shanghai Jiao Tong Universityduring 2009-2013 and as an academic visitor in the department ofcomputer science University of Oxford during 2015-2016 His researchinterests include formal verification programming language theory anddata analytics and intelligence He published more than 40 researchespapers in the international journals and mainstream conferences includ-ing TDSC SPE TECS IJFCS SCIS CSCW ICSE FORMATS ATVAetc

Page 7: JOURNAL OF LA Easy-to-Deploy API Extraction by Multi-Level ... · sufficient high-quality training data for APIs of some less frequently discussed libraries or frameworks. Another

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 7

transferred across different languages and libraries withtransfer learning

44 API Labeling by Softmax ClassificationGiven the Bi-LSTMrsquos output vector esentw for a token w inthe input text we train a binary softmax classifier to predictwhether the token is an API mention or a normal Englishword That is we have two classes in this work ie API ornon-API Softmax predicts the probability for the j-th classgiven a tokenrsquos vector esentw by

P (j|esentw ) =exp(esentw WT

j )9831232

k=1 exp(esentw WT

k )

where the vectors Wk (k=1 or 2) are parameters to belearned

45 Library Adaptation by Transfer LearningTransfer learning [12] [38] is a machine learning methodthat stores knowledge gained while solving one problemand applying it to a different but related problem It helpsto reduce the amount of training data required for the targetproblem When transfer learning is used in deep learningit has been shown to be beneficial for narrowing down thescope of possible models on a task by using the weights ofa trained model on a different but related task [14] and theshared parameters transfer learning can help model adaptthe shared knowledge in similar context [39]

API extraction tasks for different libraries can be re-garded as a set of different but related tasks Therefore weuse transfer learning to adapt a neural model trained withone libraryrsquos text (source-library-trained model) for anotherlibraryrsquos text (target library) Our neural architecture is com-posed of four main components character-level CNN wordembeddings sentence Bi-LSTM and softmax classifier Wecan transfer all or some source-library-trained model com-ponents to a target library model Without transfer learn-ing the parameters of a target-library model componentwill be randomly initialized and then learned using thetarget-library training data ie trained from scratch Withtransfer learning we use the parameters of source-library-trained model components to initialize the parameters ofthe corresponding target-library model components Aftertransferring the model parameters we can either freeze thetransferred parameters or fine-tune the transferred parame-ters using the target-library training data

5 SYSTEM IMPLEMENTATION

This section describes the current implementation 2 of ourneural architecture for API extraction

51 Preprocessing and Tokenizing Input TextOur current implementation takes as input the content of aStack Overflow post As we want to recognize API mentionswithin natural language sentences we remove stand-alonecode snippets in ltpregtltcodegt tag We then remove allHTML tags in the post content to obtain the plain text

2 httpsgithubcomJOJO201API Extraction

input (see Section 3 for the justification of taking plain textas input) We develop a sentence parser to split the pre-processed post text into sentences by punctuation Follow-ing [2] we develop a software-specific tokenizer to tokenizethe sentences This tokenizer preserves the integrity of code-like tokens For example it treats matplotlibpyplotimshow()as a single token instead of a sequence of 7 tokens ieldquomatplotlibrdquo ldquordquo ldquopyplotrdquo ldquordquo ldquoimshowrdquo ldquo(rdquo ldquo)rdquo producedby general English tokenizers

52 Model ConfigurationWe now describe the hyper-parameter settings used in ourcurrent implementation These hyper-parameters are tunedusing the validation data during model training (see Sec-tion 612 for the description of our dataset) We find thatthe neural model has very similar performance across sixlibraries dataset with the same hyper-parameters Thereforewe keep the same hyper-parameters for the six librariesdataset which can also avoid the difficulty in scaling dif-ferent hyper-parameters

521 Character-level CNNWe set the filter window size h = 3 That is the convolutionoperation extracts local features from 3 adjacent charactersat a time The size of our current character vocabulary V char

is 92 Thus we set the dimensionality of the character em-bedding dchar at 92 We initialize the character embeddingwith one-hot vector (1 at one character index and zeroin all other dimensions) The character embeddings willbe updated through back propagation during the trainingof character-level CNN We experiment 5 different N (thenumber of filters) 20 40 60 80 100 With N = 40 theCNN has an acceptable performance on the validation dataWith N = 60 and above the CNN has almost the sameperformance as N = 40 but it takes more training epochsto research the acceptable performance Therefore we useN = 40 That is the character-level embedding of a tokenhas the dimension 40

522 Pre-trained word embeddingsOur experiments involve six Python libraries Pandas Mat-plotlib NumPy OpenGL React and JDBC We collect allquestions tagged with these six libraries and all answers tothese questions in the Stack Overflow Data Dump releasedon March 18 2018 We obtain a text corpus of 380971 postsWe use the same preprocessing and tokenization steps asdescribed in Section 51 to preprocess and tokenize thecontent of these posts Then we use the GloVe [28] to learnword embeddings from this text corpus We set the wordvocabulary size |V word| at 40000 The training epoch is setat 100 to ensure the sufficient training of word embeddingsWe experiment four different dimensions of word embed-dings dword 50 100 200 400 We use dword = 200 inour current implementation as it produces a good balanceof the training efficiency and the quality of word embed-dings for API extraction on the validation data We alsoexperiment pre-trained word embeddings with all StackOverflow posts which does not significantly affect the APIextraction performance but requires much longer time fortext preprocessing and word embeddings learning

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 8

523 Sentence-context Bi-LSTMFor the RNN we use 50 hidden LTSM units to store thehidden states The dimension of the hidden state vector is50 Therefore the dimension of the output vector esentw is100 (concatenating forward and backward hidden states) Inorder to mitigate overfitting [40] a dropout layer is addedon the output of BLSTM with the dropout rate 05

53 Model TrainingTo train our neural model we use input text and itscorresponding sequence of APInon-API labels (see Sec-tion 612 for our data labeling process) The optimizerused is Adam [41] which performs well for training RNNincluding Bi-LSTM [35] The training epoch is set at 40 timesThe best performance model on the validation set is savedfor testing This modelrsquos parameters are also saved for thetransfer learning experiments

6 EXPERIMENTS

We conduct a series of experiments to answer the followingfour research questionsbull RQ1 How well can our neural architecture learn multi-

level features from the input texts Can the learned fea-tures support high-quality API extraction compared withexisting machine learning based API extraction methods

bull RQ2 What is the impact of the three feature extractors(character-level CNN word embeddings and sentence-context Bi-LSTM) on the API extraction performance

bull RQ3 How effective is transfer learning for adapting APIextraction models across libraries of the same languagewith different amount of target-library training data

bull RQ4 How effective is transfer learning for adapting APIextraction models across libraries of different languageswith different amount of target-library training data

61 Experiments SetupThis section describes the libraries used in our experimentshow we prepare training and testing data and the evalua-tion metrics of API extraction performance

611 Studied librariesOur experiments involve three Python libraries (PandasNumPy and Matplotlib) one Java library (JDBC) oneJavaScript library (React) and one C library (OpenGL) Asreported in Section 2 these six libraries support very di-verse functionalities for computer programming and havedistinct API-naming and API mention characteristics Usingthese libraries we can evaluate the effectiveness of ourneural architecture for API extraction in very diverse datasettings We can also gain insights into the effectiveness oftransfer learning for API extraction and the transferabilityof our neural architecture in different settings

612 DatasetWe collect Stack Overflow posts (questions and their an-swers) tagged with pandas numpy matplotlib opengl reactor jdbc as our experimental data We use Stack OverflowData Dump released on March 18 2018 In this data dump

TABLE 3 Basic Statistics of Our Dataset

Library Posts Sentences API mentions TokensMatplotlib 600 4920 1481 47317

Numpy 600 2786 1552 39321Pandas 600 3522 1576 42267Opengl 600 3486 1674 70757JDBC 600 4205 1184 50861React 600 3110 1262 42282Total 3600 22029 8729 292805

380971 posts are tagged with one of the six studied librariesOur data collection process follows four criteria First thenumber of posts selected and the number of API mentionsin these posts should be at the same order of magnitudeSecond API mentions should exhibit the variations of APIwriting forms commonly seen on Stack Overflow Third theselection should avoid repeatedly selecting frequently men-tioned APIs Fourth the same post should not appear in thedataset for different libraries (one post may be tagged withtwo or more studied-library tags) The first criterion ensuresthe fairness of comparison across libraries the second andthird criteria ensure the representativeness of API mentionsin the dataset and the fourth criterion ensures that there isno repeated data in different datasets

We finally include 3600 posts (600 for each library) in ourdataset and each post has at least one API mention3 Table 3summarizes the basic statistics of our dataset These postshave in total 22029 sentences 292805 token occurrencesand 41486 unique tokens after data preprocessing and to-kenization We manually label the API mentions in theselected posts 6421 sentences (3407) contains at least oneAPI mention Our dataset has in total 8729 API mentionsThe selected Matplotlib NumPy Pandas OpenGL JDBC andReact posts have 1481 1552 1576 1674 1184 and 1262 APImentions which refer to 553 694 293 201 282 and 83unique APIs of the six libraries respectively In additionwe randomly sample 100 labelled Stack Overflow posts foreach library dataset Then we examine each API mention todetermine whether it is a simple name or not and we findthat among the 600 posts there are 1634 API mentions ofwhich 694 (4247) are simple API names Our dataset notonly contains a large number of API mentions in diversewriting forms but also contains rich discussion contextaround API mentions This makes it suitable for the trainingof our neural architecture the testing of its performanceand the study of model transferability

We randomly split the 600 posts of each library into threesubsets by 622 ratio 60 as training data 20 as validationdata for tuning model hyper-parameters and 20 as testingdata to answer our research questions Since cross validationwill cost a great deal of time and our training dataset is un-biased (proved in Section 64) cross validation is not usedOur experiment results also show that the training datasetis enough to train a high-performance neural networkBesides the performance of the model in target library canbe improved by adding more training dataset from otherlibraries And if more training dataset is added there is noneed to label a new validate dataset for the target libraryto re-tune the hyper-parameters because we find that the

3 httpsdrivegooglecomdrivefolders1f7ejNVUsew9l9uPCj4Xv5gMzNbqttpoausp=sharing

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 9

TABLE 4 Comparison of CRF Baselines [2] and Our Neural Model for API Extraction

Basic CRF Full CRF Our method

Library Prec Recall F1 Prec Recall F1 Prec Recall F1

Matplotlib 6735 4784 5662 8962 6131 7282 815 8392 827NumPy 7583 3991 5229 8921 4931 6351 7824 6291 6974Pandas 6206 7049 6601 9711 7121 8216 8273 853 8280Opengl 4391 9187 5942 9457 7062 8085 8583 8374 8477JDBC 1583 8261 2658 8732 5140 6471 8469 5531 6692React 1674 8858 2816 9742 7011 8153 8795 7817 8277

neural model has very similar performance with the samehyper-parameters under different dataset settings

613 Evaluation metricsWe use precision recall and F1-score to evaluate the per-formance of an API extraction method Precision measureswhat percentage the recognized API mentions by a methodare correct Recall measures what percentage the API men-tions in the testing dataset are recognized correctly by amethod F1-score is the harmonic mean of precision andrecall ie 2 lowast ((precision lowast recall)(precision+ recall))

62 Performance of Our Neural Model (RQ1)Motivation Recently linear-chain CRF has been used tosolve the API extraction problem in informal text [2] Theyshow that machine-learning based API extraction with thatof several outperforms several commonly-used rule-basedmethods [2] [42] [43] The approach in [2] relies on human-defined orthographic features two different unsupervisedlanguage models (class-based Brown clustering and neural-network based word embedding) and API gazetteer features(API inventory) In contrast our neural architecture usesneural networks to automatically learn character- word-and sentence-level features from the input texts The firstRQ is to confirm the effectiveness of these neural-networkfeature extractors for API extraction tasks by comparingthe overall performance of our neural architecture with theperformance of the linear-chain CRF with human-definedfeatures for API extractionApproach We use the implementation of the CRF modelin [2] to build the two CRF baselines The basic CRF baselineuses only the orthographic features developed in [2] Thefull CRF baseline uses all orthographic word-clusters andAPI gazetteer features developed in [2] The basic CRF base-line is easy to deploy because it uses only the orthographicfeatures in the input texts but not any advanced word-clusters and API gazetteer features And the self-trainingprocess is not used in these two baselines since we havesufficient training data In contrast the full CRF baselineuses advanced features so that it is not as easy-to-deploy asthe basic CRF But the full CRF has much better performancethan the basic CRF as reported in [2] Comparing our modelwith these two baselines we can understand whether ourmodel can achieve a good tradeoff between easy-to-deployand high performance Note that as Ye et al [2] alreadydemonstrates the superior performance of machine-learningbased API extraction methods over rule-based methods wedo not consider rule-based baselines for API extraction inthis studyResults From Table 4 we can see that

bull Although linear CRF with full features has close performance toour model linear CRF with only orthographic features performspoorly in distinguishing API tokens from non-API tokens Thebest F1-score of the basic CRF is only 066 for PandasThe F1-score of the basic CRF for Matplotlib Numpy andOpenGL is 052-059 For JDBC and React the F1-score ofthe basic CRF is below 03 Our results are consistent withthe results in [2] when advanced word-clusters and API-gazetteer features are ablated Compared with the basicCRF the full CRF performs much better For Pandas JDBCand React the full CRF has very close performance to ourmodel But for Matplotlib Numpy and OpenGL our modelstill outperforms the full CRF by at least four points inF1-score

bull Multi-level feature embedding by our neural architecture iseffective in distinguishing API tokens from non-API tokensin the resulting embedding space All evaluation metrics ofour neural architecture are significantly higher than thoseof the basic CRF Although our neural architecture per-forms relatively worse on NumPy and JDBC its F1-scoreis still much higher than the F1-score of the basic CRFon NumPy and JDBC We examine false positives (non-API token recognized as API) and false negatives (APItoken recognized as non-API) by our neural architectureon NumPy and JDBC We find that many false positivesand false negatives involve API mentions composed ofcomplex strings for example array expressions for Numpyor SQL statements for JDBC It seems that neural networkslearn some features from such complex strings that mayconfuse the model and cause the failure to tell apart APItokens from non-API ones Furthermore by analysing theclassification results for different API mentions we findthat our model has good generalizability on different APIfunctions

With linear CRF for API extraction we cannot achieve agood tradeoff between easy-to-deploy and high performance Incontrast our neural architecture can extract effective character-word- and sentence-level features from the input texts and theextracted features alone can support high-quality API extractionwithout using any hand-crafted advanced features such as wordclusters API gazetteers

63 Impact of Feature Extractors (RQ2)Motivation Although our neural architecture achieves verygood performance as a whole we would like to furtherinvestigate how much different feature extractors (character-level CNN word embeddings and sentence-context Bi-LSTM) contribute to this good performance and how dif-ferent features affect precision and recall of API extraction

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 10

TABLE 5 The Results of Feature Ablation Experiments

Ablating char CNN Ablating Word Embeddings Ablating Bi-LSTM All featuresPrec Recall F1-score Prec Recall F1-score Prec Recall F1-score Prec Recall F1-score

Matplotlib 7570 7232 7398 8175 6399 7179 8259 6934 7544 8150 8392 8270Numpy 8100 6040 6921 7931 5847 6731 8062 5667 6644 7824 6291 6974Pandas 8050 8328 8182 8177 6801 7425 8322 7522 7965 8273 8530 8280Opengl 7758 8537 8129 8307 6803 7504 9852 7208 8305 8583 8374 8477JDBC 6865 6125 6447 6422 6562 6491 9928 4312 6613 8469 5531 6692React 6990 8571 7701 8437 7500 7941 9879 6508 7847 8795 7817 8277

Approach We ablate one kind of feature at a time fromour neural architecture That is we obtain a model withoutcharacter-level CNN one without word embeddings andone without sentence-context Bi-LSTM We compare theperformance of these three models with that of the modelwith all three feature extractors

Results In Table 5 we highlight in bold the largest drop ineach metric for a library when ablating a feature comparedwith that metric with all features We underline the increasein each metric for a library when ablating a feature com-pared with that metric with all features We can see that

bull The performance of our neural architecture is contributed bythe combined action of all its features Ablating any of thefeatures the F1-score degrades Ablating word embed-dings or Bi-LSTM causes the largest drop in F1-scorefor four libraries while ablating char-CNN causes thelargest drop in F1-score for two libraries Feature ablationhas higher impact on some libraries than others Forexample ablating char-CNN word embeddings and Bi-LSTM all cause significant drop in F1-score for Matplotliband ablating word embeddings causes significant dropin F1-score for Pandas and React In contrast ablating afeature causes relative minor drop in F1-score for NumpyJDBC and React This indicates different levels of featureslearned by our neural architecture can be more distinct forsome libraries but more complementary for others Whendifferent features are more distinct from one anotherachieving good API extraction performance relies moreon all the features

bull Different features play different roles in distinguishing APItokens from non-API tokens Ablating char-CNN causesthe drop in precision for five libraries except NumpyThe precision degrades rather significantly for Matplotlib(71) OpenGL (96) JDBC (189) and React (205)In contrast ablating char-CNN causes significant drop inrecall only for Matplotlib (138) For JDBC and React ab-lating char-CNN even causes significant increase in recall(107 and 96 respectively) Different from the impactof ablating char-CNN ablating word embeddings and Bi-LSTM usually causes the largest drop in recall rangingfrom 99 drop for Numpy to 237 drop for MatplotlibAblating word embeddings causes significant drop inprecision only for JDBC and it causes small changes inprecision for the other five libraries (three with small dropand two with small increase) Ablating Bi-LSTM causesthe increase in precision for all six libraries Especiallyfor OpenGL JDBC and React when ablating Bi-LSTM themodel has almost perfect precision (around 99) but thiscomes with a significant drop in recall (139 for OpenGL193 for JDBC and 167 React)

All feature extractors are important for high-quality API ex-traction Char-CNN is especially useful for filtering out non-API tokens while word embeddings and Bi-LSTM are especiallyuseful for recalling API tokens

64 Effectiveness of Within-Language Transfer Learn-ing (RQ3)Motivation As described in Section 611 we intentionallychoose three different-functionality Python libraries PandasMatplotlib NumPy Pandas and NumPy are functionally sim-ilar while the other two pairs are more distant The threelibraries also have different API-naming and API-mentioncharacteristics (see Section 2) We want to investigate theeffectiveness of transfer learning for API extraction acrossdifferent pairs of libraries and with different amount oftarget-library training dataApproach We use one library as source library and one ofthe other two libraries as target library We have six transfer-learning pairs for the three libraries We denote them assource-library-name rarr target-library-name such as Pandas rarrNumPy which represents transferring Pandas-trained modelto NumPy text Two of these six pairs (Pandas rarr NumPy andNumPy rarr Pandas) are similar-libraries transfer (in terms oflibrary functionalities and API-namingmention character-istics) while the other four pairs involving Matplotlib arerelatively more-distant-libraries transfer

TABLE 6 NumPy (NP) or Pandas (PD) rarr Matplotlib (MPL)NPrarrMPL PDrarrMPL MPL

Prec Recall F1 Prec Recall F1 Prec Recall F111 8264 8929 8584 8102 8006 8058 8767 7827 827012 8184 8452 8316 7161 8333 7696 8138 7024 754014 7183 7589 7381 6788 7798 7265 8122 5536 658418 7056 7560 7298 6966 7381 7171 7500 5357 6250116 7356 7202 7278 6648 7202 6916 8070 2738 4089132 7256 7883 7369 7147 6935 7047 9750 1160 2074DU 7254 6607 6916 7699 5476 6400

TABLE 7 Matplotlib (MPL) or Pandas (PD) rarr NumPy (NP)MPLrarrNP PDrarrNP NP

Prec Recall F1 Prec Recall F1 Prec Recall F111 8685 7708 8168 7751 6750 7216 7824 6292 697412 788 8208 8041 7013 6751 6879 7513 6042 669714 9364 7667 8000 6588 7000 6781 7714 4500 568418 7373 7833 7596 6584 6667 6625 7103 4292 5351116 7619 6667 7111 5833 6417 6114 5707 4875 5258132 7554 5792 6556 6027 5625 5823 7272 2333 3533DU 6416 6525 6471 6296 5932 6108

We use gradually-reduced target-library training data(1 for all data 12 14 132) to fine-tune the source-library-trained model We also train a target-library modelfrom scratch (ie with randomly initialized model parame-ters) using the same proportion of the target-library trainingdata for comparison We also use the source-library-trained

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 11

TABLE 8 Matplotlib (MPL) or NumPy (NP) rarr Pandas (PD)MPLrarrPD NPrarrPD PD

Prec Recall F1 Prec Recall F1 Prec Recall F111 8497 8962 8723 8686 8761 8723 8043 8530 828012 8618 8443 8530 8397 8300 8348 8801 7406 804314 8107 8761 8421 8107 8270 8188 7650 7695 767218 8754 8098 8413 8576 7637 8079 6930 8329 7565

116 8204 7896 8047 8245 7579 7898 8421 5072 5331132 8131 7522 7814 8143 7205 7645 8425 3084 4514DU 7165 4006 5139 7500 3545 4814

TABLE 9 Averaged Matplotlib (MPL) or NumPy(NP)rarrPandas (PD)

MPLrarrPD NPrarrPDPrec Recall F1 Prec Recall F1

116 8333 7781 8043 8339 7522 7909132 7950 7378 7653 8615 7115 7830

model directly without any fine-tuning (ie 01 target-library data) as a baseline We also randomly select 116 and132 training data for NumPy rarr Pandas and Matplotlib rarrPandas for 10 times and train the model for each time Thenwe calculate the averaged precision and recall values Basedon the averaged precision and recall values we calculate theaveraged F1-scoreResults Table 6 Table 7 and Table 8 show the experimentresults of the six pairs of transfer learning Table 9 showsthe averaged experiment results of NumPy rarr Pandas andMatplotlib rarr Pandas with 116 and 132 training dataAcronyms stand for MPL (Matplotlib) NP (NumPy) PD(Pandas) DU (Direct Use) The last column is the results oftraining the target-library model from scratch We can seethatbull Transfer learning can produce a better-performance target-

library model than training the target-library model fromscratch with the same proportion of training data This is ev-ident as the F1-score of the target-library model obtainedby fine-tuning the source-library-trained model is higherthan the F1-score of the target-library model trained fromscratch in all the transfer settings but PD rarr MPL at11 Especially for NumPy the NumPy model trainedfrom scratch with all Numpy training data has F1-score6974 However the NumPy model transferred from theMatplotlib model has F1-score 8168 (171 improvement)For the Matplotlib rarr Nump transfer even with only 116Numpy training data for fine-tuning the F1-score (7111) ofthe transferred NumPy model is still higher than the F1-score of the NumPy model trained from scratch with allNumpy training data This suggests that transfer learningmay boost the model performance even for the difficultdataset Such performance boost by transfer learning hasalso been observed in many studies [14] [44] [45] Thereason is that a target-library model can ldquoreuserdquo muchknowledge in the source-library model rather than hav-ing to learning completely new knowledge from scratch

bull Transfer learning can reduce the amount of target-library train-ing data required to obtain a high-quality model For four outof six transfer-learning pairs (NP rarr MPL MPL rarr NP NPrarr PD MPL rarr PD) the reduction ranges from 50 (12)to 875 (78) while the resulting target-library model stillhas better F1-score than the target-library model trainedfrom scratch using all target-library training data If we al-low a small degradation of F1-score (eg 3 of the F1-sore

for the target-library model trained from scratch usingall target-library training data) the reduction can go upto 938 (1516) For example for Matplotlib rarr Pandasusing 116 Pandas training data (ie only about 20 posts)for fine-tuning the obtained Pandas model still has F1-score 8047

bull Transfer learning is very effective in few-shot training settingsFew-shot training refers to training a model with onlya very small amount of data [46] In our experimentsusing 132 training data ie about 10 posts for transferlearning the F1-score of the obtained target-library modelis still improved a few points for NP rarr MPL PD rarr MPLand MPL rarr NP and is significantly improved for MPL rarrPD (529) and NP rarr PD (583) compared with directlyreusing source-library-trained model without fine-tuning(DU row) Furthermore the averaged results of NP rarrPD and MPL rarr PD for 10 randomly selected 116 and132 training data are similar to the results in Table 8and the variances are all smaller than 03 which showsour training data is unbiased Although the target-librarymodel trained from scratch still has reasonable F1-scorewith 12 to 18 training data they become completelyuseless with 116 or 132 training data In contrast thetarget-library models obtained through transfer learninghave significant better F1-score in such few shot settingsFurthermore training a model from scratch using few-shot data may result in abnormal increase in precision(eg MPL at 116 and 132 NP at 132 PD at 132) or inrecall (eg NP at 116) compared with training the modelfrom scratch using more data This abnormal increase inprecision (or recall) comes with a sharp decrease in recall(or precision) Our analysis shows that this phenomenonis caused by the biased training data in few-shot settingsIn contrast the target-library models obtained throughtransfer learning can reuse knowledge in the source modeland produce a much more balanced precision and recall(thus much better F1-score) even in the face of biased few-shot training data

bull The effectiveness of transfer learning does not correlate withthe quality of source-library-trained model Based on a not-so-high-quality source model (eg NumPy) we can still ob-tain a high-quality target model through transfer learningFor example NP rarr PD results in a Pandas model with F1-score gt 80 using only 18 of Pandas training data On theother hand a high-quality source model (eg Pandas) maynot boost the performance of the target model Amongall 36 transfer settings we have one such case ie PDrarr MPL at 11 The Matplotlib model transferred fromthe Pandas model using all Matplotlib training data isslightly worse than the Matplotlib model trained fromscratch using all training data This can be attributed tothe differences of API naming characteristics between thetwo libraries (see the last bullet)

bull The more similar the functionalities and the API-namingmention characteristics between the two libraries arethe more effectiveness the transfer learning can be Pan-das and NumPy have similar functionalities and API-namingmention characteristics For PD rarr NP and NPrarr PD the target-library model is less than 5 worse inF1-score with 18 target-library training data comparedwith the target-library model trained from scratch using

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 12

all training data In contrast PD rarr MPL has 69 dropin F1-score with 12 training data and NP rarr MPL has107 drop in F1-score with 14 training data comparedwith the Matplotlib model trained from scratch using alltraining data

bull Knowledge about non-polysemous APIs seems easy to expandwhile knowledge about polysemous APIs seems difficult toadapt Matplotlib has the least polysemous APIs (16)and Pandas and NumPy has much more polysemous APIsBoth MPL rarr PD and MPL rarr NP have high-qualitytarget models even with 18 target-library training dataThe transfer learning seems to be able to add the newknowledge ldquocommon words can be APIsrdquo to the target-library model In contrast NP rarr MPL requires at 12training data to have a high-quality target model (F1-score8316) and for PD rarr MPL the Matplotlib model (F1-score8058) by transfer learning with all training data is evenslightly worse than the model trained from scratch (F1-score 827) These results suggest that it can be difficultto adapt the knowledge ldquocommon words can be APIsrdquolearned from NumPy and Pandas to Matplotlib which doesnot have many common-word APIs

Transfer learning can effectively boost the performance of target-library model with less demand on target-library training dataIts effectiveness correlates more with the similarity of libraryfunctionalities and and API-namingmention characteristicsthan the initial quality of source-library model

65 Effectiveness of Across-Language Transfer Learn-ing (RQ4)Motivation In the RQ3 we investigate the transferabilityof API extraction model across three Python libraries Inthe RQ4 we want to further investigate the transferabilityof API extraction model in a more challenging setting ieacross different programming languages and across librarieswith very different functionalitiesApproach For the source language we choose Python oneof the most popular programming languages We randomlyselect 200 posts in each dataset of the three Python librariesand combine these 600 posts as the training data for PythonAPI extraction model For the target languages we choosethree other popular programming languages Java JavaScriptand C That is we have three transfer-learning pairs in thisRQ Python rarr Java Python rarr JavaScript and Python rarrC For the three target languages we intentionally choosethe libraries that support very different functionalities fromthe three Python libraries For Java we choose JDBC (anAPI for accessing relational database) For JavaScript wechoose React (a library for web graphical user interface)For C we choose OpenGL (a library for 3D graphics) Asdescribed in Section 612 we label 600 posts for each target-language library for the experiments As in the RQ3 weuse gradually-reduced target-language training data to fine-tune the source-language-trained modelResults Table 10 Table 11 and Table 12 show the experimentresults for the three across-language transfer-learning pairsThese three tables use the same notation as Table 6 Table 7and Table 8 We can see thatbull Directly reusing the source-language-trained model on the

target-language text produces unacceptable performance For

TABLE 10 Python rarr Java

PythonrarrJava JavaPrec Recall F1 Prec Recall F1

11 7745 6656 7160 8469 5531 669212 7220 6250 6706 7738 5344 632214 6926 5000 5808 7122 4719 567718 5000 6406 5616 7515 3969 5194

116 5571 4825 5200 7569 3406 4698132 5699 4000 4483 7789 2312 3566DU 4444 2875 3491

TABLE 11 Python rarr JavaScript

PythonrarrJavaScript JavaScriptPrec Recall F1 Prec Recall F1

11 7745 8175 8619 8795 7817 827712 8693 7659 8143 8756 5984 777014 8684 6865 7424 8308 6626 737218 8148 6111 6984 8598 5595 6788

116 7111 6349 6808 8738 3848 5344132 6667 5238 5867 6521 3571 4615DU 5163 2500 3369

the three Python libraries directly reusing a source-library-trained model on the text of a target-library maystill have reasonable performance for example NP rarrMPL (F1-score 6916) PD rarr MPL (640) MPL rarr NP(6471) However for the three across-language transfer-learning pairs the F1-score of direct model reuse is verylow Python rarr Java (3491) Python rarr JavaScript (3369)Python rarr C (3595) These results suggest that there arestill certain level of commonalities between the librarieswith similar functionalities and of the same programminglanguage so that the knowledge learned in the source-library model may be largely reused in the target-librarymodel In contrast the libraries of different functionali-ties and programming languages have much fewer com-monalities which makes it infeasible to directly deploya source-language-trained model to the text of anotherlanguage In such cases we have to either train the modelfor each language or library from scratch or we may ex-ploit transfer learning to adapt a source-language-trainedmodel to the target-language text

bull Across-language transfer learning holds the same performancecharacteristics as within-language transfer learning but itdemands more target-language training data to obtain a high-quality model than within-language transfer learning ForPython rarr JavaScript and Python rarr C with as minimumas 116 target-language training data we can obtaina fine-tuned target-language model whose F1-score candouble that of directly reusing the Python-trained modelto JavaScript or C text For Python rarr Java fine-tuning thePython-trained model with 116 Java training data booststhe F1-score by 50 compared with directly applying thePython-trained model to Java text For all the transfer-learning settings in Table 10 Table 11 and Table 12the fine-tuned target-language model always outperformsthe corresponding target-language model trained fromscratch with the same amount of target-language trainingdata However it requires at least 12 target-languagetraining data to fine-tune a target-language model toachieve the same level of performance as the target-language model trained from scratch with all target-

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 13

TABLE 12 Python rarr C

PythonrarrC CPrec Recall F1 Prec Recall F1

11 8987 8509 8747 8583 8374 847712 8535 8373 8454 8628 7669 812114 8383 7588 7965 8219 7127 763418 7432 7371 7401 7868 6802 7297

116 7581 6945 7260 8852 5710 6942132 6962 6585 6779 8789 4525 5975DU 5649 2791 3595

language training data For the within-language transferlearning achieving this result may require as minimum as18 target-library training data

Software text of different programming languages and of li-braries with very different functionalities increases the difficultyof transfer learning but transfer learning can still effectivelyboost the performance of the target-language model with 12less target-language training data compared with training thetarget-language model from scratch with all training data

66 Threats to Validity

The major threat to internal validity is the API labelingerrors in the dataset In order to decrease errors the first twoauthors first independently label the same data and thenresolve the disagreements by discussion Furthermore inour experiments we have to examine many API extractionresults We only spot a few instances of overlooked orerroneously-labeled API mentions Therefore the quality ofour dataset is trustworthy

The major threat to external validity is the generalizationof our results and findings Although we invest significanttime and effort to prepare datasets conduct experimentsand analyze results our experiments involve only six li-braries of four programming languages The performance ofour neural architecture and especially the findings on trans-fer learning could be different with other programminglanguages and libraries Furthermore the software text inall the experiments comes from a single data source ieStack Overflow Software text from other data sources mayexhibit different API-mention and discussion-context char-acteristics which may affect the model performance In thefuture we will reduce this threat by applying our approachto more languageslibraries and informal software text fromother data sources (eg developer emails)

7 RELATED WORK

APIs are the core resources for software development andAPI related knowledge is widely present in software textssuch as API reference documentation QampA discussionsbug reports Researchers have proposed many API extrac-tion methods to support various software engineering tasksespecially document traceability recovery [42] [47] [48][49] [50] For example RecoDoc [3] extracts Java APIsfrom several resources and then recover traceability acrossdifferent sources Subramanian et al [4] use code contextinformation to filter candidate APIs in a knowledge basefor an API mention in a partial code fragment Christophand Robillard [51] extracts sentences about API usage from

Stack Overflow and use them to augment API referencedocumentation These works focus mainly on the API link-ing task ie linking API mentions in text or code to someAPI entities in a knowledge base In terms of extracting APImentions in software text they rely on rule-based methodsfor example regular expressions of distinct orthographicfeatures of APIs such as camelcase special characters (eg or ()) and API annotations

Several studies such as Bacchelli et al [42] [43] and Yeet al [2] show that regular expressions of distinct ortho-graphic features are not reliable for API extraction tasks ininformal texts such as emails Stack Overflow posts Bothvariations of sentence formats and the wide presence ofmentions of polysemous API simple names pose a greatchallenge for API extraction in informal texts Howeverthis challenge is generally avoided by considering onlyAPI mentions with distinct orthographic features in existingworks [3] [51] [52]

Island parsing provides a more robust solution for ex-tracting API mentions from texts Using an island grammarwe can separate the textual content into constructs of in-terest (island) and the remainder (water) [53] For exampleBacchelli et al [54] uses island parsing to extract code frag-ments from natural language text Rigby and Robillard [52]also use island parser to identify code-like elements that canpotentially be APIs However these island parsers cannoteffectively deal with mentions of API simple names that arenot suffixed by () such as apply and series in Fig 1 whichare very common writing forms of API methods in StackOverflow discussions [2]

Recently Ye et al [10] proposes a CRF based approachfor extracting methods of software entities such as pro-gramming languages libraries computing concepts andAPIs from informal software texts They report that extractAPI mentions is much more challenging than extractingother types of software entities A follow-up work by Yeet al [2] proposes to use a combination of orthographicword-clusters and API gazetteer features to enhance theCRFrsquos performance on API extraction tasks Although theseproposed features are effective they require much manualeffort to develop and tune Furthermore enough trainingdata has to be prepared and manually labeled for applyingtheir approach to the software text of each library to beprocessed These overheads pose a practical limitation todeploying their approach to a larger number of libraries

Our neural architecture is inspired by recent advancesof neural network techniques for NLP tasks For exampleboth RNNs and CNNs have been used to embed character-level features for question answering [55] [56] machinetranslation [57] text classification [58] and part-of-speechtagging [59] Some researchers also use word embeddingsand LSTMs for NER [35] [60] [61] To the best of our knowl-edge our neural architecture is the first machine learningbased API extraction method that combines these proposalsand customize them based on the characteristics of soft-ware texts and API names Furthermore the design of ourneural architecture also takes into account the deploymentoverhead of the API methods for multiple programminglanguages and libraries which has never been explored forAPI extraction tasks

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 14

8 CONCULSION

This paper presents a novel multi-layer neural architecturethat can effectively learn important character- word- andsentence-level features from informal software texts for ex-tracting API mentions in text The learned features has supe-rior performance than human-defined orthographic featuresfor API extraction in informal software texts This makes ourneural architecture easy to deploy because the only input itrequires is the texts to be processed In contrast existingmachine learning based API extraction methods have touse additional hand-crafted features such as word clustersor API gazetteers in order to achieve the performanceclose to that of our neural architecture Furthermore as thefeatures are automatically learned from the input texts ourneural architecture is easy to transfer and fine-tune acrossprogramming languages and libraries We demonstrate itstransferability across three Python libraries and across fourprogramming languages Our neural architecture togetherwith transfer learning makes it easy to train and deploya high-quality API extraction model for multiple program-ming languages and libraries with much less overall effortrequired for preparing training data and effective featuresIn the future we will further investigate the performanceand the transferability of our neural architecture in manyother programming languages and libraries moving to-wards real-world deployment of machine learning basedAPI extraction methods

REFERENCES

[1] C Parnin C Treude L Grammel and M-A Storey ldquoCrowddocumentation Exploring the coverage and the dynamics of apidiscussions on stack overflowrdquo

[2] D Ye Z Xing C Y Foo J Li and N Kapre ldquoLearning to extractapi mentions from informal natural language discussionsrdquo in Soft-ware Maintenance and Evolution (ICSME) 2016 IEEE InternationalConference on IEEE 2016 pp 389ndash399

[3] B Dagenais and M P Robillard ldquoRecovering traceability linksbetween an api and its learning resourcesrdquo in Software Engineering(ICSE) 2012 34th International Conference on IEEE 2012 pp 47ndash57

[4] S Subramanian L Inozemtseva and R Holmes ldquoLive api doc-umentationrdquo in Proceedings of the 36th International Conference onSoftware Engineering ACM 2014 pp 643ndash652

[5] C Chen Z Xing and X Wang ldquoUnsupervised software-specificmorphological forms inference from informal discussionsrdquo in Pro-ceedings of the 39th International Conference on Software EngineeringIEEE Press 2017 pp 450ndash461

[6] X Chen C Chen D Zhang and Z Xing ldquoSethesaurus Wordnetin software engineeringrdquo IEEE Transactions on Software Engineer-ing 2019

[7] H Li S Li J Sun Z Xing X Peng M Liu and X Zhao ldquoIm-proving api caveats accessibility by mining api caveats knowledgegraphrdquo in Software Maintenance and Evolution (ICSME) 2018 IEEEInternational Conference on IEEE 2018

[8] Z X D L X W Qiao Huang Xin Xia ldquoApi method recom-mendation without worrying about the task-api knowledge gaprdquoin Automated Software Engineering (ASE) 2018 33th IEEEACMInternational Conference on IEEE 2018

[9] B Xu Z Xing X Xia and D Lo ldquoAnswerbot automated gen-eration of answer summary to developersz technical questionsrdquoin Proceedings of the 32nd IEEEACM International Conference onAutomated Software Engineering IEEE Press 2017 pp 706ndash716

[10] D Ye Z Xing C Y Foo Z Q Ang J Li and N Kapre ldquoSoftware-specific named entity recognition in software engineering socialcontentrdquo in 2016 IEEE 23rd International Conference on SoftwareAnalysis Evolution and Reengineering (SANER) vol 1 IEEE 2016pp 90ndash101

[11] E F Tjong Kim Sang and F De Meulder ldquoIntroduction to the conll-2003 shared task Language-independent named entity recogni-tionrdquo in Proceedings of the seventh conference on Natural languagelearning at HLT-NAACL 2003-Volume 4 Association for Compu-tational Linguistics 2003 pp 142ndash147

[12] R Caruana ldquoLearning many related tasks at the same time withbackpropagationrdquo in Advances in neural information processing sys-tems 1995 pp 657ndash664

[13] A Arnold R Nallapati and W W Cohen ldquoExploiting featurehierarchy for transfer learning in named entity recognitionrdquo Pro-ceedings of ACL-08 HLT pp 245ndash253 2008

[14] J Yosinski J Clune Y Bengio and H Lipson ldquoHow transferableare features in deep neural networksrdquo in Advances in neuralinformation processing systems 2014 pp 3320ndash3328

[15] Z X Deheng Ye Lingfeng Bao and S-W Lin ldquoApireal An apirecognition and linking approach for online developer forumsrdquoEmpirical Software Engineering 2018 2018

[16] C dos Santos and M Gatti ldquoDeep convolutional neural networksfor sentiment analysis of short textsrdquo in Proceedings of COLING2014 the 25th International Conference on Computational LinguisticsTechnical Papers 2014 pp 69ndash78

[17] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[18] G Chen C Chen Z Xing and B Xu ldquoLearning a dual-languagevector space for domain-specific cross-lingual question retrievalrdquoin 2016 31st IEEEACM International Conference on Automated Soft-ware Engineering (ASE) IEEE 2016 pp 744ndash755

[19] C Chen X Chen J Sun Z Xing and G Li ldquoData-driven proac-tive policy assurance of post quality in community qampa sitesrdquoProceedings of the ACM on human-computer interaction vol 2 noCSCW p 33 2018

[20] I Bazzi ldquoModelling out-of-vocabulary words for robust speechrecognitionrdquo PhD dissertation Massachusetts Institute of Tech-nology 2002

[21] C Chen S Gao and Z Xing ldquoMining analogical libraries in qampadiscussionsndashincorporating relational and categorical knowledgeinto word embeddingrdquo in 2016 IEEE 23rd international conferenceon software analysis evolution and reengineering (SANER) vol 1IEEE 2016 pp 338ndash348

[22] C Chen and Z Xing ldquoSimilartech automatically recommendanalogical libraries across different programming languagesrdquo in2016 31st IEEEACM International Conference on Automated SoftwareEngineering (ASE) IEEE 2016 pp 834ndash839

[23] C Chen Z Xing and Y Liu ldquoWhats spains paris mining analogi-cal libraries from qampa discussionsrdquo Empirical Software Engineeringvol 24 no 3 pp 1155ndash1194 2019

[24] Y Huang C Chen Z Xing T Lin and Y Liu ldquoTell them apartdistilling technology differences from crowd-scale comparisondiscussionsrdquo in ASE 2018 pp 214ndash224

[25] O Levy and Y Goldberg ldquoNeural word embedding as implicitmatrix factorizationrdquo in Advances in neural information processingsystems 2014 pp 2177ndash2185

[26] D Tang F Wei N Yang M Zhou T Liu and B Qin ldquoLearningsentiment-specific word embedding for twitter sentiment classifi-cationrdquo in Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) vol 1 2014pp 1555ndash1565

[27] O Levy Y Goldberg and I Dagan ldquoImproving distributional sim-ilarity with lessons learned from word embeddingsrdquo Transactionsof the Association for Computational Linguistics vol 3 pp 211ndash2252015

[28] J Pennington R Socher and C Manning ldquoGlove Global vectorsfor word representationrdquo in Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP) 2014 pp1532ndash1543

[29] I Sutskever O Vinyals and Q V Le ldquoSequence to sequencelearning with neural networksrdquo in Advances in neural informationprocessing systems 2014 pp 3104ndash3112

[30] C Chen Z Xing and Y Liu ldquoBy the community amp for the com-munity a deep learning approach to assist collaborative editing inqampa sitesrdquo Proceedings of the ACM on Human-Computer Interactionvol 1 no CSCW p 32 2017

[31] S Gao C Chen Z Xing Y Ma W Song and S-W Lin ldquoA neuralmodel for method name generation from functional descriptionrdquo

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 15

in 2019 IEEE 26th International Conference on Software AnalysisEvolution and Reengineering (SANER) IEEE 2019 pp 414ndash421

[32] X Wang C Chen and Z Xing ldquoDomain-specific machine trans-lation with recurrent neural network for software localizationrdquoEmpirical Software Engineering pp 1ndash32 2019

[33] C Chen Z Xing Y Liu and K L X Ong ldquoMining likelyanalogical apis across third-party libraries via large-scale unsu-pervised api semantics embeddingrdquo IEEE Transactions on SoftwareEngineering 2019

[34] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no 11pp 2673ndash2681 1997

[35] Z Huang W Xu and K Yu ldquoBidirectional lstm-crf models forsequence taggingrdquo arXiv preprint arXiv150801991 2015

[36] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural computation vol 9 no 8 pp 1735ndash1780 1997

[37] H Sak A Senior and F Beaufays ldquoLong short-term memoryrecurrent neural network architectures for large scale acousticmodelingrdquo in Fifteenth annual conference of the international speechcommunication association 2014

[38] Y Bengio ldquoDeep learning of representations for unsupervised andtransfer learningrdquo in Proceedings of ICML Workshop on Unsupervisedand Transfer Learning 2012 pp 17ndash36

[39] S J Pan Q Yang et al ldquoA survey on transfer learningrdquo[40] N Srivastava G Hinton A Krizhevsky I Sutskever and

R Salakhutdinov ldquoDropout A simple way to prevent neural net-works from overfittingrdquo The Journal of Machine Learning Researchvol 15 no 1 pp 1929ndash1958 2014

[41] D P Kingma and J Ba ldquoAdam A method for stochastic optimiza-tionrdquo arXiv preprint arXiv14126980 2014

[42] A Bacchelli M Lanza and R Robbes ldquoLinking e-mails andsource code artifactsrdquo in Proceedings of the 32nd ACMIEEE Inter-national Conference on Software Engineering-Volume 1 ACM 2010pp 375ndash384

[43] A Bacchelli M DrsquoAmbros M Lanza and R Robbes ldquoBench-marking lightweight techniques to link e-mails and source coderdquoin Reverse Engineering 2009 WCRErsquo09 16th Working Conference onIEEE 2009 pp 205ndash214

[44] S J Pan and Q Yang ldquoA survey on transfer learningrdquo IEEETransactions on knowledge and data engineering vol 22 no 10 pp1345ndash1359 2010

[45] M Oquab L Bottou I Laptev and J Sivic ldquoLearning and transfer-ring mid-level image representations using convolutional neuralnetworksrdquo in Computer Vision and Pattern Recognition (CVPR) 2014IEEE Conference on IEEE 2014 pp 1717ndash1724

[46] S Ravi and H Larochelle ldquoOptimization as a model for few-shotlearningrdquo 2016

[47] A Marcus J Maletic et al ldquoRecovering documentation-to-source-code traceability links using latent semantic indexingrdquo in Soft-ware Engineering 2003 Proceedings 25th International Conference onIEEE 2003 pp 125ndash135

[48] H-Y Jiang T N Nguyen X Chen H Jaygarl and C K ChangldquoIncremental latent semantic indexing for automatic traceabil-ity link evolution managementrdquo in Proceedings of the 2008 23rdIEEEACM International Conference on Automated Software Engineer-ing IEEE Computer Society 2008 pp 59ndash68

[49] W Zheng Q Zhang and M Lyu ldquoCross-library api recommen-dation using web search enginesrdquo in Proceedings of the 19th ACMSIGSOFT symposium and the 13th European conference on Foundationsof software engineering ACM 2011 pp 480ndash483

[50] Q Gao H Zhang J Wang Y Xiong L Zhang and H Mei ldquoFixingrecurring crash bugs via analyzing qampa sites (t)rdquo in AutomatedSoftware Engineering (ASE) 2015 30th IEEEACM International Con-ference on IEEE 2015 pp 307ndash318

[51] C Treude and M P Robillard ldquoAugmenting api documentationwith insights from stack overflowrdquo in Software Engineering (ICSE)2016 IEEEACM 38th International Conference on IEEE 2016 pp392ndash403

[52] P C Rigby and M P Robillard ldquoDiscovering essential codeelements in informal documentationrdquo in Proceedings of the 2013International Conference on Software Engineering IEEE Press 2013pp 832ndash841

[53] L Moonen ldquoGenerating robust parsers using island grammarsrdquoin Reverse Engineering 2001 Proceedings Eighth Working Conferenceon IEEE 2001 pp 13ndash22

[54] A Bacchelli A Cleve M Lanza and A Mocci ldquoExtracting struc-tured data from natural language documents with island parsingrdquo

in Automated Software Engineering (ASE) 2011 26th IEEEACMInternational Conference on IEEE 2011 pp 476ndash479

[55] Y Kim Y Jernite D Sontag and A M Rush ldquoCharacter-awareneural language modelsrdquo in AAAI 2016 pp 2741ndash2749

[56] D Lukovnikov A Fischer J Lehmann and S Auer ldquoNeuralnetwork-based question answering over knowledge graphs onword and character levelrdquo in Proceedings of the 26th internationalconference on World Wide Web International World Wide WebConferences Steering Committee 2017 pp 1211ndash1220

[57] W Ling I Trancoso C Dyer and A W Black ldquoCharacter-basedneural machine translationrdquo arXiv preprint arXiv151104586 2015

[58] X Zhang J Zhao and Y LeCun ldquoCharacter-level convolutionalnetworks for text classificationrdquo in Advances in neural informationprocessing systems 2015 pp 649ndash657

[59] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[60] J P Chiu and E Nichols ldquoNamed entity recognition with bidirec-tional lstm-cnnsrdquo arXiv preprint arXiv151108308 2015

[61] G Lample M Ballesteros S Subramanian K Kawakami andC Dyer ldquoNeural architectures for named entity recognitionrdquoarXiv preprint arXiv160301360 2016

Suyu Ma is a research assistant in the Facultyof Information Technology at Monash UniversityHe has research interest in the areas of soft-ware engineering Deep learning and Human-computer Interaction He is currently focusing onimproving the usability and accessibility of mo-bile application He received the BS and MSdegrees from Beijing Technology and BusinessUniversity and the Australian National Universityin 2016 and 2018 respectively And he will bea PhD student at Monash University in 2020

under the supervision of Chunyang Chen

Zhenchang Xing is a Senior Lecturer in the Re-search School of Computer Science AustralianNational University Previously he was an Assis-tant Professor in the School of Computer Sci-ence and Engineering Nanyang TechnologicalUniversity Singapore from 2012-2016 Beforejoining NTU Dr Xing was a Lee Kuan Yew Re-search Fellow in the School of Computing Na-tional University of Singapore from 2009-2012Dr Xings current research area is in the interdis-plinary areas of software engineering human-

computer interaction and applied data analytics Dr Xing has over 100publications in peer-refereed journals and conference proceedings andhave received several distinguished paper awards from top softwareengineering conferences Dr Xing regularly serves on the organizationand program committees of the top software engineering conferencesand he will be the program committee co-chair for ICSME2020

Chunyang Chen is a lecturer (Assistant Pro-fessor) in Faculty of Information TechnologyMonash University Australia His research fo-cuses on software engineering deep learningand human-computer interaction He has pub-lished over 16 papers in referred journals orconferences including Empirical Software Engi-neering ICSE ASE CSCW ICSME SANERHe is a member of IEEE and ACM He receivedACM SIGSOFT distinguished paper award inASE 2018 best paper award in SANER 2016

and best tool demo in ASE 2016 httpschunyang-chengithubio

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 16

Cheng Chen received his Bachelor degree inSoftware Engineering from Northwest University(China) and received Master degree of Com-puting (major in Artificial Intelligence) from theAustralian National University He did some in-ternship projects about Natural Language Pro-cessing at CSIRO from 2016 to 2017 He workedas a research assistant supervised by Dr Zhen-chang Xing at ANU in 2018 Cheng currentlyworks in PricewaterhouseCoopers (PwC) Firmas a senior algorithm engineer of Natural Lan-

guage Processing and Data Analyzing Cheng is interested in NamedEnti ty Extraction Relation Extraction Text summarization and parallelcomputing He is working on knowledge engineering and transaction forNLP tasks

Lizhen Qu is a research fellow in the DialogueResearch lab in the Faculty of Information Tech-nology at Monash University He has exten-sive research experience in the areas of naturallanguage processing multimodal learning deeplearning and Cybersecurity He is currently fo-cusing on information extraction semantic pars-ing and multimodal dialogue systems Prior tojoining Monash University he worked as a re-search scientist at Data61CSIRO where he ledand participated in several research and indus-

trial projects including Deep Learning for Text and Deep Learning forCyber

Guoqiang Li is now an associate professor inschool of software Shanghai Jiao Tong Univer-sity and a guest associate professor in KyushuUniversity He received the BS MS and PhDdegrees from Taiyuan University of TechnologyShanghai Jiao Tong University and Japan Ad-vanced Institute of Science and Technology in2001 2005 and 2008 respectively He workedas a postdoctoral research fellow in the graduateschool of information science Nagoya Univer-sity Japan during 2008-2009 as an assistant

professor in the school of software Shanghai Jiao Tong Universityduring 2009-2013 and as an academic visitor in the department ofcomputer science University of Oxford during 2015-2016 His researchinterests include formal verification programming language theory anddata analytics and intelligence He published more than 40 researchespapers in the international journals and mainstream conferences includ-ing TDSC SPE TECS IJFCS SCIS CSCW ICSE FORMATS ATVAetc

Page 8: JOURNAL OF LA Easy-to-Deploy API Extraction by Multi-Level ... · sufficient high-quality training data for APIs of some less frequently discussed libraries or frameworks. Another

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 8

523 Sentence-context Bi-LSTMFor the RNN we use 50 hidden LTSM units to store thehidden states The dimension of the hidden state vector is50 Therefore the dimension of the output vector esentw is100 (concatenating forward and backward hidden states) Inorder to mitigate overfitting [40] a dropout layer is addedon the output of BLSTM with the dropout rate 05

53 Model TrainingTo train our neural model we use input text and itscorresponding sequence of APInon-API labels (see Sec-tion 612 for our data labeling process) The optimizerused is Adam [41] which performs well for training RNNincluding Bi-LSTM [35] The training epoch is set at 40 timesThe best performance model on the validation set is savedfor testing This modelrsquos parameters are also saved for thetransfer learning experiments

6 EXPERIMENTS

We conduct a series of experiments to answer the followingfour research questionsbull RQ1 How well can our neural architecture learn multi-

level features from the input texts Can the learned fea-tures support high-quality API extraction compared withexisting machine learning based API extraction methods

bull RQ2 What is the impact of the three feature extractors(character-level CNN word embeddings and sentence-context Bi-LSTM) on the API extraction performance

bull RQ3 How effective is transfer learning for adapting APIextraction models across libraries of the same languagewith different amount of target-library training data

bull RQ4 How effective is transfer learning for adapting APIextraction models across libraries of different languageswith different amount of target-library training data

61 Experiments SetupThis section describes the libraries used in our experimentshow we prepare training and testing data and the evalua-tion metrics of API extraction performance

611 Studied librariesOur experiments involve three Python libraries (PandasNumPy and Matplotlib) one Java library (JDBC) oneJavaScript library (React) and one C library (OpenGL) Asreported in Section 2 these six libraries support very di-verse functionalities for computer programming and havedistinct API-naming and API mention characteristics Usingthese libraries we can evaluate the effectiveness of ourneural architecture for API extraction in very diverse datasettings We can also gain insights into the effectiveness oftransfer learning for API extraction and the transferabilityof our neural architecture in different settings

612 DatasetWe collect Stack Overflow posts (questions and their an-swers) tagged with pandas numpy matplotlib opengl reactor jdbc as our experimental data We use Stack OverflowData Dump released on March 18 2018 In this data dump

TABLE 3 Basic Statistics of Our Dataset

Library Posts Sentences API mentions TokensMatplotlib 600 4920 1481 47317

Numpy 600 2786 1552 39321Pandas 600 3522 1576 42267Opengl 600 3486 1674 70757JDBC 600 4205 1184 50861React 600 3110 1262 42282Total 3600 22029 8729 292805

380971 posts are tagged with one of the six studied librariesOur data collection process follows four criteria First thenumber of posts selected and the number of API mentionsin these posts should be at the same order of magnitudeSecond API mentions should exhibit the variations of APIwriting forms commonly seen on Stack Overflow Third theselection should avoid repeatedly selecting frequently men-tioned APIs Fourth the same post should not appear in thedataset for different libraries (one post may be tagged withtwo or more studied-library tags) The first criterion ensuresthe fairness of comparison across libraries the second andthird criteria ensure the representativeness of API mentionsin the dataset and the fourth criterion ensures that there isno repeated data in different datasets

We finally include 3600 posts (600 for each library) in ourdataset and each post has at least one API mention3 Table 3summarizes the basic statistics of our dataset These postshave in total 22029 sentences 292805 token occurrencesand 41486 unique tokens after data preprocessing and to-kenization We manually label the API mentions in theselected posts 6421 sentences (3407) contains at least oneAPI mention Our dataset has in total 8729 API mentionsThe selected Matplotlib NumPy Pandas OpenGL JDBC andReact posts have 1481 1552 1576 1674 1184 and 1262 APImentions which refer to 553 694 293 201 282 and 83unique APIs of the six libraries respectively In additionwe randomly sample 100 labelled Stack Overflow posts foreach library dataset Then we examine each API mention todetermine whether it is a simple name or not and we findthat among the 600 posts there are 1634 API mentions ofwhich 694 (4247) are simple API names Our dataset notonly contains a large number of API mentions in diversewriting forms but also contains rich discussion contextaround API mentions This makes it suitable for the trainingof our neural architecture the testing of its performanceand the study of model transferability

We randomly split the 600 posts of each library into threesubsets by 622 ratio 60 as training data 20 as validationdata for tuning model hyper-parameters and 20 as testingdata to answer our research questions Since cross validationwill cost a great deal of time and our training dataset is un-biased (proved in Section 64) cross validation is not usedOur experiment results also show that the training datasetis enough to train a high-performance neural networkBesides the performance of the model in target library canbe improved by adding more training dataset from otherlibraries And if more training dataset is added there is noneed to label a new validate dataset for the target libraryto re-tune the hyper-parameters because we find that the

3 httpsdrivegooglecomdrivefolders1f7ejNVUsew9l9uPCj4Xv5gMzNbqttpoausp=sharing

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 9

TABLE 4 Comparison of CRF Baselines [2] and Our Neural Model for API Extraction

Basic CRF Full CRF Our method

Library Prec Recall F1 Prec Recall F1 Prec Recall F1

Matplotlib 6735 4784 5662 8962 6131 7282 815 8392 827NumPy 7583 3991 5229 8921 4931 6351 7824 6291 6974Pandas 6206 7049 6601 9711 7121 8216 8273 853 8280Opengl 4391 9187 5942 9457 7062 8085 8583 8374 8477JDBC 1583 8261 2658 8732 5140 6471 8469 5531 6692React 1674 8858 2816 9742 7011 8153 8795 7817 8277

neural model has very similar performance with the samehyper-parameters under different dataset settings

613 Evaluation metricsWe use precision recall and F1-score to evaluate the per-formance of an API extraction method Precision measureswhat percentage the recognized API mentions by a methodare correct Recall measures what percentage the API men-tions in the testing dataset are recognized correctly by amethod F1-score is the harmonic mean of precision andrecall ie 2 lowast ((precision lowast recall)(precision+ recall))

62 Performance of Our Neural Model (RQ1)Motivation Recently linear-chain CRF has been used tosolve the API extraction problem in informal text [2] Theyshow that machine-learning based API extraction with thatof several outperforms several commonly-used rule-basedmethods [2] [42] [43] The approach in [2] relies on human-defined orthographic features two different unsupervisedlanguage models (class-based Brown clustering and neural-network based word embedding) and API gazetteer features(API inventory) In contrast our neural architecture usesneural networks to automatically learn character- word-and sentence-level features from the input texts The firstRQ is to confirm the effectiveness of these neural-networkfeature extractors for API extraction tasks by comparingthe overall performance of our neural architecture with theperformance of the linear-chain CRF with human-definedfeatures for API extractionApproach We use the implementation of the CRF modelin [2] to build the two CRF baselines The basic CRF baselineuses only the orthographic features developed in [2] Thefull CRF baseline uses all orthographic word-clusters andAPI gazetteer features developed in [2] The basic CRF base-line is easy to deploy because it uses only the orthographicfeatures in the input texts but not any advanced word-clusters and API gazetteer features And the self-trainingprocess is not used in these two baselines since we havesufficient training data In contrast the full CRF baselineuses advanced features so that it is not as easy-to-deploy asthe basic CRF But the full CRF has much better performancethan the basic CRF as reported in [2] Comparing our modelwith these two baselines we can understand whether ourmodel can achieve a good tradeoff between easy-to-deployand high performance Note that as Ye et al [2] alreadydemonstrates the superior performance of machine-learningbased API extraction methods over rule-based methods wedo not consider rule-based baselines for API extraction inthis studyResults From Table 4 we can see that

bull Although linear CRF with full features has close performance toour model linear CRF with only orthographic features performspoorly in distinguishing API tokens from non-API tokens Thebest F1-score of the basic CRF is only 066 for PandasThe F1-score of the basic CRF for Matplotlib Numpy andOpenGL is 052-059 For JDBC and React the F1-score ofthe basic CRF is below 03 Our results are consistent withthe results in [2] when advanced word-clusters and API-gazetteer features are ablated Compared with the basicCRF the full CRF performs much better For Pandas JDBCand React the full CRF has very close performance to ourmodel But for Matplotlib Numpy and OpenGL our modelstill outperforms the full CRF by at least four points inF1-score

bull Multi-level feature embedding by our neural architecture iseffective in distinguishing API tokens from non-API tokensin the resulting embedding space All evaluation metrics ofour neural architecture are significantly higher than thoseof the basic CRF Although our neural architecture per-forms relatively worse on NumPy and JDBC its F1-scoreis still much higher than the F1-score of the basic CRFon NumPy and JDBC We examine false positives (non-API token recognized as API) and false negatives (APItoken recognized as non-API) by our neural architectureon NumPy and JDBC We find that many false positivesand false negatives involve API mentions composed ofcomplex strings for example array expressions for Numpyor SQL statements for JDBC It seems that neural networkslearn some features from such complex strings that mayconfuse the model and cause the failure to tell apart APItokens from non-API ones Furthermore by analysing theclassification results for different API mentions we findthat our model has good generalizability on different APIfunctions

With linear CRF for API extraction we cannot achieve agood tradeoff between easy-to-deploy and high performance Incontrast our neural architecture can extract effective character-word- and sentence-level features from the input texts and theextracted features alone can support high-quality API extractionwithout using any hand-crafted advanced features such as wordclusters API gazetteers

63 Impact of Feature Extractors (RQ2)Motivation Although our neural architecture achieves verygood performance as a whole we would like to furtherinvestigate how much different feature extractors (character-level CNN word embeddings and sentence-context Bi-LSTM) contribute to this good performance and how dif-ferent features affect precision and recall of API extraction

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 10

TABLE 5 The Results of Feature Ablation Experiments

Ablating char CNN Ablating Word Embeddings Ablating Bi-LSTM All featuresPrec Recall F1-score Prec Recall F1-score Prec Recall F1-score Prec Recall F1-score

Matplotlib 7570 7232 7398 8175 6399 7179 8259 6934 7544 8150 8392 8270Numpy 8100 6040 6921 7931 5847 6731 8062 5667 6644 7824 6291 6974Pandas 8050 8328 8182 8177 6801 7425 8322 7522 7965 8273 8530 8280Opengl 7758 8537 8129 8307 6803 7504 9852 7208 8305 8583 8374 8477JDBC 6865 6125 6447 6422 6562 6491 9928 4312 6613 8469 5531 6692React 6990 8571 7701 8437 7500 7941 9879 6508 7847 8795 7817 8277

Approach We ablate one kind of feature at a time fromour neural architecture That is we obtain a model withoutcharacter-level CNN one without word embeddings andone without sentence-context Bi-LSTM We compare theperformance of these three models with that of the modelwith all three feature extractors

Results In Table 5 we highlight in bold the largest drop ineach metric for a library when ablating a feature comparedwith that metric with all features We underline the increasein each metric for a library when ablating a feature com-pared with that metric with all features We can see that

bull The performance of our neural architecture is contributed bythe combined action of all its features Ablating any of thefeatures the F1-score degrades Ablating word embed-dings or Bi-LSTM causes the largest drop in F1-scorefor four libraries while ablating char-CNN causes thelargest drop in F1-score for two libraries Feature ablationhas higher impact on some libraries than others Forexample ablating char-CNN word embeddings and Bi-LSTM all cause significant drop in F1-score for Matplotliband ablating word embeddings causes significant dropin F1-score for Pandas and React In contrast ablating afeature causes relative minor drop in F1-score for NumpyJDBC and React This indicates different levels of featureslearned by our neural architecture can be more distinct forsome libraries but more complementary for others Whendifferent features are more distinct from one anotherachieving good API extraction performance relies moreon all the features

bull Different features play different roles in distinguishing APItokens from non-API tokens Ablating char-CNN causesthe drop in precision for five libraries except NumpyThe precision degrades rather significantly for Matplotlib(71) OpenGL (96) JDBC (189) and React (205)In contrast ablating char-CNN causes significant drop inrecall only for Matplotlib (138) For JDBC and React ab-lating char-CNN even causes significant increase in recall(107 and 96 respectively) Different from the impactof ablating char-CNN ablating word embeddings and Bi-LSTM usually causes the largest drop in recall rangingfrom 99 drop for Numpy to 237 drop for MatplotlibAblating word embeddings causes significant drop inprecision only for JDBC and it causes small changes inprecision for the other five libraries (three with small dropand two with small increase) Ablating Bi-LSTM causesthe increase in precision for all six libraries Especiallyfor OpenGL JDBC and React when ablating Bi-LSTM themodel has almost perfect precision (around 99) but thiscomes with a significant drop in recall (139 for OpenGL193 for JDBC and 167 React)

All feature extractors are important for high-quality API ex-traction Char-CNN is especially useful for filtering out non-API tokens while word embeddings and Bi-LSTM are especiallyuseful for recalling API tokens

64 Effectiveness of Within-Language Transfer Learn-ing (RQ3)Motivation As described in Section 611 we intentionallychoose three different-functionality Python libraries PandasMatplotlib NumPy Pandas and NumPy are functionally sim-ilar while the other two pairs are more distant The threelibraries also have different API-naming and API-mentioncharacteristics (see Section 2) We want to investigate theeffectiveness of transfer learning for API extraction acrossdifferent pairs of libraries and with different amount oftarget-library training dataApproach We use one library as source library and one ofthe other two libraries as target library We have six transfer-learning pairs for the three libraries We denote them assource-library-name rarr target-library-name such as Pandas rarrNumPy which represents transferring Pandas-trained modelto NumPy text Two of these six pairs (Pandas rarr NumPy andNumPy rarr Pandas) are similar-libraries transfer (in terms oflibrary functionalities and API-namingmention character-istics) while the other four pairs involving Matplotlib arerelatively more-distant-libraries transfer

TABLE 6 NumPy (NP) or Pandas (PD) rarr Matplotlib (MPL)NPrarrMPL PDrarrMPL MPL

Prec Recall F1 Prec Recall F1 Prec Recall F111 8264 8929 8584 8102 8006 8058 8767 7827 827012 8184 8452 8316 7161 8333 7696 8138 7024 754014 7183 7589 7381 6788 7798 7265 8122 5536 658418 7056 7560 7298 6966 7381 7171 7500 5357 6250116 7356 7202 7278 6648 7202 6916 8070 2738 4089132 7256 7883 7369 7147 6935 7047 9750 1160 2074DU 7254 6607 6916 7699 5476 6400

TABLE 7 Matplotlib (MPL) or Pandas (PD) rarr NumPy (NP)MPLrarrNP PDrarrNP NP

Prec Recall F1 Prec Recall F1 Prec Recall F111 8685 7708 8168 7751 6750 7216 7824 6292 697412 788 8208 8041 7013 6751 6879 7513 6042 669714 9364 7667 8000 6588 7000 6781 7714 4500 568418 7373 7833 7596 6584 6667 6625 7103 4292 5351116 7619 6667 7111 5833 6417 6114 5707 4875 5258132 7554 5792 6556 6027 5625 5823 7272 2333 3533DU 6416 6525 6471 6296 5932 6108

We use gradually-reduced target-library training data(1 for all data 12 14 132) to fine-tune the source-library-trained model We also train a target-library modelfrom scratch (ie with randomly initialized model parame-ters) using the same proportion of the target-library trainingdata for comparison We also use the source-library-trained

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 11

TABLE 8 Matplotlib (MPL) or NumPy (NP) rarr Pandas (PD)MPLrarrPD NPrarrPD PD

Prec Recall F1 Prec Recall F1 Prec Recall F111 8497 8962 8723 8686 8761 8723 8043 8530 828012 8618 8443 8530 8397 8300 8348 8801 7406 804314 8107 8761 8421 8107 8270 8188 7650 7695 767218 8754 8098 8413 8576 7637 8079 6930 8329 7565

116 8204 7896 8047 8245 7579 7898 8421 5072 5331132 8131 7522 7814 8143 7205 7645 8425 3084 4514DU 7165 4006 5139 7500 3545 4814

TABLE 9 Averaged Matplotlib (MPL) or NumPy(NP)rarrPandas (PD)

MPLrarrPD NPrarrPDPrec Recall F1 Prec Recall F1

116 8333 7781 8043 8339 7522 7909132 7950 7378 7653 8615 7115 7830

model directly without any fine-tuning (ie 01 target-library data) as a baseline We also randomly select 116 and132 training data for NumPy rarr Pandas and Matplotlib rarrPandas for 10 times and train the model for each time Thenwe calculate the averaged precision and recall values Basedon the averaged precision and recall values we calculate theaveraged F1-scoreResults Table 6 Table 7 and Table 8 show the experimentresults of the six pairs of transfer learning Table 9 showsthe averaged experiment results of NumPy rarr Pandas andMatplotlib rarr Pandas with 116 and 132 training dataAcronyms stand for MPL (Matplotlib) NP (NumPy) PD(Pandas) DU (Direct Use) The last column is the results oftraining the target-library model from scratch We can seethatbull Transfer learning can produce a better-performance target-

library model than training the target-library model fromscratch with the same proportion of training data This is ev-ident as the F1-score of the target-library model obtainedby fine-tuning the source-library-trained model is higherthan the F1-score of the target-library model trained fromscratch in all the transfer settings but PD rarr MPL at11 Especially for NumPy the NumPy model trainedfrom scratch with all Numpy training data has F1-score6974 However the NumPy model transferred from theMatplotlib model has F1-score 8168 (171 improvement)For the Matplotlib rarr Nump transfer even with only 116Numpy training data for fine-tuning the F1-score (7111) ofthe transferred NumPy model is still higher than the F1-score of the NumPy model trained from scratch with allNumpy training data This suggests that transfer learningmay boost the model performance even for the difficultdataset Such performance boost by transfer learning hasalso been observed in many studies [14] [44] [45] Thereason is that a target-library model can ldquoreuserdquo muchknowledge in the source-library model rather than hav-ing to learning completely new knowledge from scratch

bull Transfer learning can reduce the amount of target-library train-ing data required to obtain a high-quality model For four outof six transfer-learning pairs (NP rarr MPL MPL rarr NP NPrarr PD MPL rarr PD) the reduction ranges from 50 (12)to 875 (78) while the resulting target-library model stillhas better F1-score than the target-library model trainedfrom scratch using all target-library training data If we al-low a small degradation of F1-score (eg 3 of the F1-sore

for the target-library model trained from scratch usingall target-library training data) the reduction can go upto 938 (1516) For example for Matplotlib rarr Pandasusing 116 Pandas training data (ie only about 20 posts)for fine-tuning the obtained Pandas model still has F1-score 8047

bull Transfer learning is very effective in few-shot training settingsFew-shot training refers to training a model with onlya very small amount of data [46] In our experimentsusing 132 training data ie about 10 posts for transferlearning the F1-score of the obtained target-library modelis still improved a few points for NP rarr MPL PD rarr MPLand MPL rarr NP and is significantly improved for MPL rarrPD (529) and NP rarr PD (583) compared with directlyreusing source-library-trained model without fine-tuning(DU row) Furthermore the averaged results of NP rarrPD and MPL rarr PD for 10 randomly selected 116 and132 training data are similar to the results in Table 8and the variances are all smaller than 03 which showsour training data is unbiased Although the target-librarymodel trained from scratch still has reasonable F1-scorewith 12 to 18 training data they become completelyuseless with 116 or 132 training data In contrast thetarget-library models obtained through transfer learninghave significant better F1-score in such few shot settingsFurthermore training a model from scratch using few-shot data may result in abnormal increase in precision(eg MPL at 116 and 132 NP at 132 PD at 132) or inrecall (eg NP at 116) compared with training the modelfrom scratch using more data This abnormal increase inprecision (or recall) comes with a sharp decrease in recall(or precision) Our analysis shows that this phenomenonis caused by the biased training data in few-shot settingsIn contrast the target-library models obtained throughtransfer learning can reuse knowledge in the source modeland produce a much more balanced precision and recall(thus much better F1-score) even in the face of biased few-shot training data

bull The effectiveness of transfer learning does not correlate withthe quality of source-library-trained model Based on a not-so-high-quality source model (eg NumPy) we can still ob-tain a high-quality target model through transfer learningFor example NP rarr PD results in a Pandas model with F1-score gt 80 using only 18 of Pandas training data On theother hand a high-quality source model (eg Pandas) maynot boost the performance of the target model Amongall 36 transfer settings we have one such case ie PDrarr MPL at 11 The Matplotlib model transferred fromthe Pandas model using all Matplotlib training data isslightly worse than the Matplotlib model trained fromscratch using all training data This can be attributed tothe differences of API naming characteristics between thetwo libraries (see the last bullet)

bull The more similar the functionalities and the API-namingmention characteristics between the two libraries arethe more effectiveness the transfer learning can be Pan-das and NumPy have similar functionalities and API-namingmention characteristics For PD rarr NP and NPrarr PD the target-library model is less than 5 worse inF1-score with 18 target-library training data comparedwith the target-library model trained from scratch using

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 12

all training data In contrast PD rarr MPL has 69 dropin F1-score with 12 training data and NP rarr MPL has107 drop in F1-score with 14 training data comparedwith the Matplotlib model trained from scratch using alltraining data

bull Knowledge about non-polysemous APIs seems easy to expandwhile knowledge about polysemous APIs seems difficult toadapt Matplotlib has the least polysemous APIs (16)and Pandas and NumPy has much more polysemous APIsBoth MPL rarr PD and MPL rarr NP have high-qualitytarget models even with 18 target-library training dataThe transfer learning seems to be able to add the newknowledge ldquocommon words can be APIsrdquo to the target-library model In contrast NP rarr MPL requires at 12training data to have a high-quality target model (F1-score8316) and for PD rarr MPL the Matplotlib model (F1-score8058) by transfer learning with all training data is evenslightly worse than the model trained from scratch (F1-score 827) These results suggest that it can be difficultto adapt the knowledge ldquocommon words can be APIsrdquolearned from NumPy and Pandas to Matplotlib which doesnot have many common-word APIs

Transfer learning can effectively boost the performance of target-library model with less demand on target-library training dataIts effectiveness correlates more with the similarity of libraryfunctionalities and and API-namingmention characteristicsthan the initial quality of source-library model

65 Effectiveness of Across-Language Transfer Learn-ing (RQ4)Motivation In the RQ3 we investigate the transferabilityof API extraction model across three Python libraries Inthe RQ4 we want to further investigate the transferabilityof API extraction model in a more challenging setting ieacross different programming languages and across librarieswith very different functionalitiesApproach For the source language we choose Python oneof the most popular programming languages We randomlyselect 200 posts in each dataset of the three Python librariesand combine these 600 posts as the training data for PythonAPI extraction model For the target languages we choosethree other popular programming languages Java JavaScriptand C That is we have three transfer-learning pairs in thisRQ Python rarr Java Python rarr JavaScript and Python rarrC For the three target languages we intentionally choosethe libraries that support very different functionalities fromthe three Python libraries For Java we choose JDBC (anAPI for accessing relational database) For JavaScript wechoose React (a library for web graphical user interface)For C we choose OpenGL (a library for 3D graphics) Asdescribed in Section 612 we label 600 posts for each target-language library for the experiments As in the RQ3 weuse gradually-reduced target-language training data to fine-tune the source-language-trained modelResults Table 10 Table 11 and Table 12 show the experimentresults for the three across-language transfer-learning pairsThese three tables use the same notation as Table 6 Table 7and Table 8 We can see thatbull Directly reusing the source-language-trained model on the

target-language text produces unacceptable performance For

TABLE 10 Python rarr Java

PythonrarrJava JavaPrec Recall F1 Prec Recall F1

11 7745 6656 7160 8469 5531 669212 7220 6250 6706 7738 5344 632214 6926 5000 5808 7122 4719 567718 5000 6406 5616 7515 3969 5194

116 5571 4825 5200 7569 3406 4698132 5699 4000 4483 7789 2312 3566DU 4444 2875 3491

TABLE 11 Python rarr JavaScript

PythonrarrJavaScript JavaScriptPrec Recall F1 Prec Recall F1

11 7745 8175 8619 8795 7817 827712 8693 7659 8143 8756 5984 777014 8684 6865 7424 8308 6626 737218 8148 6111 6984 8598 5595 6788

116 7111 6349 6808 8738 3848 5344132 6667 5238 5867 6521 3571 4615DU 5163 2500 3369

the three Python libraries directly reusing a source-library-trained model on the text of a target-library maystill have reasonable performance for example NP rarrMPL (F1-score 6916) PD rarr MPL (640) MPL rarr NP(6471) However for the three across-language transfer-learning pairs the F1-score of direct model reuse is verylow Python rarr Java (3491) Python rarr JavaScript (3369)Python rarr C (3595) These results suggest that there arestill certain level of commonalities between the librarieswith similar functionalities and of the same programminglanguage so that the knowledge learned in the source-library model may be largely reused in the target-librarymodel In contrast the libraries of different functionali-ties and programming languages have much fewer com-monalities which makes it infeasible to directly deploya source-language-trained model to the text of anotherlanguage In such cases we have to either train the modelfor each language or library from scratch or we may ex-ploit transfer learning to adapt a source-language-trainedmodel to the target-language text

bull Across-language transfer learning holds the same performancecharacteristics as within-language transfer learning but itdemands more target-language training data to obtain a high-quality model than within-language transfer learning ForPython rarr JavaScript and Python rarr C with as minimumas 116 target-language training data we can obtaina fine-tuned target-language model whose F1-score candouble that of directly reusing the Python-trained modelto JavaScript or C text For Python rarr Java fine-tuning thePython-trained model with 116 Java training data booststhe F1-score by 50 compared with directly applying thePython-trained model to Java text For all the transfer-learning settings in Table 10 Table 11 and Table 12the fine-tuned target-language model always outperformsthe corresponding target-language model trained fromscratch with the same amount of target-language trainingdata However it requires at least 12 target-languagetraining data to fine-tune a target-language model toachieve the same level of performance as the target-language model trained from scratch with all target-

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 13

TABLE 12 Python rarr C

PythonrarrC CPrec Recall F1 Prec Recall F1

11 8987 8509 8747 8583 8374 847712 8535 8373 8454 8628 7669 812114 8383 7588 7965 8219 7127 763418 7432 7371 7401 7868 6802 7297

116 7581 6945 7260 8852 5710 6942132 6962 6585 6779 8789 4525 5975DU 5649 2791 3595

language training data For the within-language transferlearning achieving this result may require as minimum as18 target-library training data

Software text of different programming languages and of li-braries with very different functionalities increases the difficultyof transfer learning but transfer learning can still effectivelyboost the performance of the target-language model with 12less target-language training data compared with training thetarget-language model from scratch with all training data

66 Threats to Validity

The major threat to internal validity is the API labelingerrors in the dataset In order to decrease errors the first twoauthors first independently label the same data and thenresolve the disagreements by discussion Furthermore inour experiments we have to examine many API extractionresults We only spot a few instances of overlooked orerroneously-labeled API mentions Therefore the quality ofour dataset is trustworthy

The major threat to external validity is the generalizationof our results and findings Although we invest significanttime and effort to prepare datasets conduct experimentsand analyze results our experiments involve only six li-braries of four programming languages The performance ofour neural architecture and especially the findings on trans-fer learning could be different with other programminglanguages and libraries Furthermore the software text inall the experiments comes from a single data source ieStack Overflow Software text from other data sources mayexhibit different API-mention and discussion-context char-acteristics which may affect the model performance In thefuture we will reduce this threat by applying our approachto more languageslibraries and informal software text fromother data sources (eg developer emails)

7 RELATED WORK

APIs are the core resources for software development andAPI related knowledge is widely present in software textssuch as API reference documentation QampA discussionsbug reports Researchers have proposed many API extrac-tion methods to support various software engineering tasksespecially document traceability recovery [42] [47] [48][49] [50] For example RecoDoc [3] extracts Java APIsfrom several resources and then recover traceability acrossdifferent sources Subramanian et al [4] use code contextinformation to filter candidate APIs in a knowledge basefor an API mention in a partial code fragment Christophand Robillard [51] extracts sentences about API usage from

Stack Overflow and use them to augment API referencedocumentation These works focus mainly on the API link-ing task ie linking API mentions in text or code to someAPI entities in a knowledge base In terms of extracting APImentions in software text they rely on rule-based methodsfor example regular expressions of distinct orthographicfeatures of APIs such as camelcase special characters (eg or ()) and API annotations

Several studies such as Bacchelli et al [42] [43] and Yeet al [2] show that regular expressions of distinct ortho-graphic features are not reliable for API extraction tasks ininformal texts such as emails Stack Overflow posts Bothvariations of sentence formats and the wide presence ofmentions of polysemous API simple names pose a greatchallenge for API extraction in informal texts Howeverthis challenge is generally avoided by considering onlyAPI mentions with distinct orthographic features in existingworks [3] [51] [52]

Island parsing provides a more robust solution for ex-tracting API mentions from texts Using an island grammarwe can separate the textual content into constructs of in-terest (island) and the remainder (water) [53] For exampleBacchelli et al [54] uses island parsing to extract code frag-ments from natural language text Rigby and Robillard [52]also use island parser to identify code-like elements that canpotentially be APIs However these island parsers cannoteffectively deal with mentions of API simple names that arenot suffixed by () such as apply and series in Fig 1 whichare very common writing forms of API methods in StackOverflow discussions [2]

Recently Ye et al [10] proposes a CRF based approachfor extracting methods of software entities such as pro-gramming languages libraries computing concepts andAPIs from informal software texts They report that extractAPI mentions is much more challenging than extractingother types of software entities A follow-up work by Yeet al [2] proposes to use a combination of orthographicword-clusters and API gazetteer features to enhance theCRFrsquos performance on API extraction tasks Although theseproposed features are effective they require much manualeffort to develop and tune Furthermore enough trainingdata has to be prepared and manually labeled for applyingtheir approach to the software text of each library to beprocessed These overheads pose a practical limitation todeploying their approach to a larger number of libraries

Our neural architecture is inspired by recent advancesof neural network techniques for NLP tasks For exampleboth RNNs and CNNs have been used to embed character-level features for question answering [55] [56] machinetranslation [57] text classification [58] and part-of-speechtagging [59] Some researchers also use word embeddingsand LSTMs for NER [35] [60] [61] To the best of our knowl-edge our neural architecture is the first machine learningbased API extraction method that combines these proposalsand customize them based on the characteristics of soft-ware texts and API names Furthermore the design of ourneural architecture also takes into account the deploymentoverhead of the API methods for multiple programminglanguages and libraries which has never been explored forAPI extraction tasks

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 14

8 CONCULSION

This paper presents a novel multi-layer neural architecturethat can effectively learn important character- word- andsentence-level features from informal software texts for ex-tracting API mentions in text The learned features has supe-rior performance than human-defined orthographic featuresfor API extraction in informal software texts This makes ourneural architecture easy to deploy because the only input itrequires is the texts to be processed In contrast existingmachine learning based API extraction methods have touse additional hand-crafted features such as word clustersor API gazetteers in order to achieve the performanceclose to that of our neural architecture Furthermore as thefeatures are automatically learned from the input texts ourneural architecture is easy to transfer and fine-tune acrossprogramming languages and libraries We demonstrate itstransferability across three Python libraries and across fourprogramming languages Our neural architecture togetherwith transfer learning makes it easy to train and deploya high-quality API extraction model for multiple program-ming languages and libraries with much less overall effortrequired for preparing training data and effective featuresIn the future we will further investigate the performanceand the transferability of our neural architecture in manyother programming languages and libraries moving to-wards real-world deployment of machine learning basedAPI extraction methods

REFERENCES

[1] C Parnin C Treude L Grammel and M-A Storey ldquoCrowddocumentation Exploring the coverage and the dynamics of apidiscussions on stack overflowrdquo

[2] D Ye Z Xing C Y Foo J Li and N Kapre ldquoLearning to extractapi mentions from informal natural language discussionsrdquo in Soft-ware Maintenance and Evolution (ICSME) 2016 IEEE InternationalConference on IEEE 2016 pp 389ndash399

[3] B Dagenais and M P Robillard ldquoRecovering traceability linksbetween an api and its learning resourcesrdquo in Software Engineering(ICSE) 2012 34th International Conference on IEEE 2012 pp 47ndash57

[4] S Subramanian L Inozemtseva and R Holmes ldquoLive api doc-umentationrdquo in Proceedings of the 36th International Conference onSoftware Engineering ACM 2014 pp 643ndash652

[5] C Chen Z Xing and X Wang ldquoUnsupervised software-specificmorphological forms inference from informal discussionsrdquo in Pro-ceedings of the 39th International Conference on Software EngineeringIEEE Press 2017 pp 450ndash461

[6] X Chen C Chen D Zhang and Z Xing ldquoSethesaurus Wordnetin software engineeringrdquo IEEE Transactions on Software Engineer-ing 2019

[7] H Li S Li J Sun Z Xing X Peng M Liu and X Zhao ldquoIm-proving api caveats accessibility by mining api caveats knowledgegraphrdquo in Software Maintenance and Evolution (ICSME) 2018 IEEEInternational Conference on IEEE 2018

[8] Z X D L X W Qiao Huang Xin Xia ldquoApi method recom-mendation without worrying about the task-api knowledge gaprdquoin Automated Software Engineering (ASE) 2018 33th IEEEACMInternational Conference on IEEE 2018

[9] B Xu Z Xing X Xia and D Lo ldquoAnswerbot automated gen-eration of answer summary to developersz technical questionsrdquoin Proceedings of the 32nd IEEEACM International Conference onAutomated Software Engineering IEEE Press 2017 pp 706ndash716

[10] D Ye Z Xing C Y Foo Z Q Ang J Li and N Kapre ldquoSoftware-specific named entity recognition in software engineering socialcontentrdquo in 2016 IEEE 23rd International Conference on SoftwareAnalysis Evolution and Reengineering (SANER) vol 1 IEEE 2016pp 90ndash101

[11] E F Tjong Kim Sang and F De Meulder ldquoIntroduction to the conll-2003 shared task Language-independent named entity recogni-tionrdquo in Proceedings of the seventh conference on Natural languagelearning at HLT-NAACL 2003-Volume 4 Association for Compu-tational Linguistics 2003 pp 142ndash147

[12] R Caruana ldquoLearning many related tasks at the same time withbackpropagationrdquo in Advances in neural information processing sys-tems 1995 pp 657ndash664

[13] A Arnold R Nallapati and W W Cohen ldquoExploiting featurehierarchy for transfer learning in named entity recognitionrdquo Pro-ceedings of ACL-08 HLT pp 245ndash253 2008

[14] J Yosinski J Clune Y Bengio and H Lipson ldquoHow transferableare features in deep neural networksrdquo in Advances in neuralinformation processing systems 2014 pp 3320ndash3328

[15] Z X Deheng Ye Lingfeng Bao and S-W Lin ldquoApireal An apirecognition and linking approach for online developer forumsrdquoEmpirical Software Engineering 2018 2018

[16] C dos Santos and M Gatti ldquoDeep convolutional neural networksfor sentiment analysis of short textsrdquo in Proceedings of COLING2014 the 25th International Conference on Computational LinguisticsTechnical Papers 2014 pp 69ndash78

[17] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[18] G Chen C Chen Z Xing and B Xu ldquoLearning a dual-languagevector space for domain-specific cross-lingual question retrievalrdquoin 2016 31st IEEEACM International Conference on Automated Soft-ware Engineering (ASE) IEEE 2016 pp 744ndash755

[19] C Chen X Chen J Sun Z Xing and G Li ldquoData-driven proac-tive policy assurance of post quality in community qampa sitesrdquoProceedings of the ACM on human-computer interaction vol 2 noCSCW p 33 2018

[20] I Bazzi ldquoModelling out-of-vocabulary words for robust speechrecognitionrdquo PhD dissertation Massachusetts Institute of Tech-nology 2002

[21] C Chen S Gao and Z Xing ldquoMining analogical libraries in qampadiscussionsndashincorporating relational and categorical knowledgeinto word embeddingrdquo in 2016 IEEE 23rd international conferenceon software analysis evolution and reengineering (SANER) vol 1IEEE 2016 pp 338ndash348

[22] C Chen and Z Xing ldquoSimilartech automatically recommendanalogical libraries across different programming languagesrdquo in2016 31st IEEEACM International Conference on Automated SoftwareEngineering (ASE) IEEE 2016 pp 834ndash839

[23] C Chen Z Xing and Y Liu ldquoWhats spains paris mining analogi-cal libraries from qampa discussionsrdquo Empirical Software Engineeringvol 24 no 3 pp 1155ndash1194 2019

[24] Y Huang C Chen Z Xing T Lin and Y Liu ldquoTell them apartdistilling technology differences from crowd-scale comparisondiscussionsrdquo in ASE 2018 pp 214ndash224

[25] O Levy and Y Goldberg ldquoNeural word embedding as implicitmatrix factorizationrdquo in Advances in neural information processingsystems 2014 pp 2177ndash2185

[26] D Tang F Wei N Yang M Zhou T Liu and B Qin ldquoLearningsentiment-specific word embedding for twitter sentiment classifi-cationrdquo in Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) vol 1 2014pp 1555ndash1565

[27] O Levy Y Goldberg and I Dagan ldquoImproving distributional sim-ilarity with lessons learned from word embeddingsrdquo Transactionsof the Association for Computational Linguistics vol 3 pp 211ndash2252015

[28] J Pennington R Socher and C Manning ldquoGlove Global vectorsfor word representationrdquo in Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP) 2014 pp1532ndash1543

[29] I Sutskever O Vinyals and Q V Le ldquoSequence to sequencelearning with neural networksrdquo in Advances in neural informationprocessing systems 2014 pp 3104ndash3112

[30] C Chen Z Xing and Y Liu ldquoBy the community amp for the com-munity a deep learning approach to assist collaborative editing inqampa sitesrdquo Proceedings of the ACM on Human-Computer Interactionvol 1 no CSCW p 32 2017

[31] S Gao C Chen Z Xing Y Ma W Song and S-W Lin ldquoA neuralmodel for method name generation from functional descriptionrdquo

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 15

in 2019 IEEE 26th International Conference on Software AnalysisEvolution and Reengineering (SANER) IEEE 2019 pp 414ndash421

[32] X Wang C Chen and Z Xing ldquoDomain-specific machine trans-lation with recurrent neural network for software localizationrdquoEmpirical Software Engineering pp 1ndash32 2019

[33] C Chen Z Xing Y Liu and K L X Ong ldquoMining likelyanalogical apis across third-party libraries via large-scale unsu-pervised api semantics embeddingrdquo IEEE Transactions on SoftwareEngineering 2019

[34] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no 11pp 2673ndash2681 1997

[35] Z Huang W Xu and K Yu ldquoBidirectional lstm-crf models forsequence taggingrdquo arXiv preprint arXiv150801991 2015

[36] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural computation vol 9 no 8 pp 1735ndash1780 1997

[37] H Sak A Senior and F Beaufays ldquoLong short-term memoryrecurrent neural network architectures for large scale acousticmodelingrdquo in Fifteenth annual conference of the international speechcommunication association 2014

[38] Y Bengio ldquoDeep learning of representations for unsupervised andtransfer learningrdquo in Proceedings of ICML Workshop on Unsupervisedand Transfer Learning 2012 pp 17ndash36

[39] S J Pan Q Yang et al ldquoA survey on transfer learningrdquo[40] N Srivastava G Hinton A Krizhevsky I Sutskever and

R Salakhutdinov ldquoDropout A simple way to prevent neural net-works from overfittingrdquo The Journal of Machine Learning Researchvol 15 no 1 pp 1929ndash1958 2014

[41] D P Kingma and J Ba ldquoAdam A method for stochastic optimiza-tionrdquo arXiv preprint arXiv14126980 2014

[42] A Bacchelli M Lanza and R Robbes ldquoLinking e-mails andsource code artifactsrdquo in Proceedings of the 32nd ACMIEEE Inter-national Conference on Software Engineering-Volume 1 ACM 2010pp 375ndash384

[43] A Bacchelli M DrsquoAmbros M Lanza and R Robbes ldquoBench-marking lightweight techniques to link e-mails and source coderdquoin Reverse Engineering 2009 WCRErsquo09 16th Working Conference onIEEE 2009 pp 205ndash214

[44] S J Pan and Q Yang ldquoA survey on transfer learningrdquo IEEETransactions on knowledge and data engineering vol 22 no 10 pp1345ndash1359 2010

[45] M Oquab L Bottou I Laptev and J Sivic ldquoLearning and transfer-ring mid-level image representations using convolutional neuralnetworksrdquo in Computer Vision and Pattern Recognition (CVPR) 2014IEEE Conference on IEEE 2014 pp 1717ndash1724

[46] S Ravi and H Larochelle ldquoOptimization as a model for few-shotlearningrdquo 2016

[47] A Marcus J Maletic et al ldquoRecovering documentation-to-source-code traceability links using latent semantic indexingrdquo in Soft-ware Engineering 2003 Proceedings 25th International Conference onIEEE 2003 pp 125ndash135

[48] H-Y Jiang T N Nguyen X Chen H Jaygarl and C K ChangldquoIncremental latent semantic indexing for automatic traceabil-ity link evolution managementrdquo in Proceedings of the 2008 23rdIEEEACM International Conference on Automated Software Engineer-ing IEEE Computer Society 2008 pp 59ndash68

[49] W Zheng Q Zhang and M Lyu ldquoCross-library api recommen-dation using web search enginesrdquo in Proceedings of the 19th ACMSIGSOFT symposium and the 13th European conference on Foundationsof software engineering ACM 2011 pp 480ndash483

[50] Q Gao H Zhang J Wang Y Xiong L Zhang and H Mei ldquoFixingrecurring crash bugs via analyzing qampa sites (t)rdquo in AutomatedSoftware Engineering (ASE) 2015 30th IEEEACM International Con-ference on IEEE 2015 pp 307ndash318

[51] C Treude and M P Robillard ldquoAugmenting api documentationwith insights from stack overflowrdquo in Software Engineering (ICSE)2016 IEEEACM 38th International Conference on IEEE 2016 pp392ndash403

[52] P C Rigby and M P Robillard ldquoDiscovering essential codeelements in informal documentationrdquo in Proceedings of the 2013International Conference on Software Engineering IEEE Press 2013pp 832ndash841

[53] L Moonen ldquoGenerating robust parsers using island grammarsrdquoin Reverse Engineering 2001 Proceedings Eighth Working Conferenceon IEEE 2001 pp 13ndash22

[54] A Bacchelli A Cleve M Lanza and A Mocci ldquoExtracting struc-tured data from natural language documents with island parsingrdquo

in Automated Software Engineering (ASE) 2011 26th IEEEACMInternational Conference on IEEE 2011 pp 476ndash479

[55] Y Kim Y Jernite D Sontag and A M Rush ldquoCharacter-awareneural language modelsrdquo in AAAI 2016 pp 2741ndash2749

[56] D Lukovnikov A Fischer J Lehmann and S Auer ldquoNeuralnetwork-based question answering over knowledge graphs onword and character levelrdquo in Proceedings of the 26th internationalconference on World Wide Web International World Wide WebConferences Steering Committee 2017 pp 1211ndash1220

[57] W Ling I Trancoso C Dyer and A W Black ldquoCharacter-basedneural machine translationrdquo arXiv preprint arXiv151104586 2015

[58] X Zhang J Zhao and Y LeCun ldquoCharacter-level convolutionalnetworks for text classificationrdquo in Advances in neural informationprocessing systems 2015 pp 649ndash657

[59] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[60] J P Chiu and E Nichols ldquoNamed entity recognition with bidirec-tional lstm-cnnsrdquo arXiv preprint arXiv151108308 2015

[61] G Lample M Ballesteros S Subramanian K Kawakami andC Dyer ldquoNeural architectures for named entity recognitionrdquoarXiv preprint arXiv160301360 2016

Suyu Ma is a research assistant in the Facultyof Information Technology at Monash UniversityHe has research interest in the areas of soft-ware engineering Deep learning and Human-computer Interaction He is currently focusing onimproving the usability and accessibility of mo-bile application He received the BS and MSdegrees from Beijing Technology and BusinessUniversity and the Australian National Universityin 2016 and 2018 respectively And he will bea PhD student at Monash University in 2020

under the supervision of Chunyang Chen

Zhenchang Xing is a Senior Lecturer in the Re-search School of Computer Science AustralianNational University Previously he was an Assis-tant Professor in the School of Computer Sci-ence and Engineering Nanyang TechnologicalUniversity Singapore from 2012-2016 Beforejoining NTU Dr Xing was a Lee Kuan Yew Re-search Fellow in the School of Computing Na-tional University of Singapore from 2009-2012Dr Xings current research area is in the interdis-plinary areas of software engineering human-

computer interaction and applied data analytics Dr Xing has over 100publications in peer-refereed journals and conference proceedings andhave received several distinguished paper awards from top softwareengineering conferences Dr Xing regularly serves on the organizationand program committees of the top software engineering conferencesand he will be the program committee co-chair for ICSME2020

Chunyang Chen is a lecturer (Assistant Pro-fessor) in Faculty of Information TechnologyMonash University Australia His research fo-cuses on software engineering deep learningand human-computer interaction He has pub-lished over 16 papers in referred journals orconferences including Empirical Software Engi-neering ICSE ASE CSCW ICSME SANERHe is a member of IEEE and ACM He receivedACM SIGSOFT distinguished paper award inASE 2018 best paper award in SANER 2016

and best tool demo in ASE 2016 httpschunyang-chengithubio

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 16

Cheng Chen received his Bachelor degree inSoftware Engineering from Northwest University(China) and received Master degree of Com-puting (major in Artificial Intelligence) from theAustralian National University He did some in-ternship projects about Natural Language Pro-cessing at CSIRO from 2016 to 2017 He workedas a research assistant supervised by Dr Zhen-chang Xing at ANU in 2018 Cheng currentlyworks in PricewaterhouseCoopers (PwC) Firmas a senior algorithm engineer of Natural Lan-

guage Processing and Data Analyzing Cheng is interested in NamedEnti ty Extraction Relation Extraction Text summarization and parallelcomputing He is working on knowledge engineering and transaction forNLP tasks

Lizhen Qu is a research fellow in the DialogueResearch lab in the Faculty of Information Tech-nology at Monash University He has exten-sive research experience in the areas of naturallanguage processing multimodal learning deeplearning and Cybersecurity He is currently fo-cusing on information extraction semantic pars-ing and multimodal dialogue systems Prior tojoining Monash University he worked as a re-search scientist at Data61CSIRO where he ledand participated in several research and indus-

trial projects including Deep Learning for Text and Deep Learning forCyber

Guoqiang Li is now an associate professor inschool of software Shanghai Jiao Tong Univer-sity and a guest associate professor in KyushuUniversity He received the BS MS and PhDdegrees from Taiyuan University of TechnologyShanghai Jiao Tong University and Japan Ad-vanced Institute of Science and Technology in2001 2005 and 2008 respectively He workedas a postdoctoral research fellow in the graduateschool of information science Nagoya Univer-sity Japan during 2008-2009 as an assistant

professor in the school of software Shanghai Jiao Tong Universityduring 2009-2013 and as an academic visitor in the department ofcomputer science University of Oxford during 2015-2016 His researchinterests include formal verification programming language theory anddata analytics and intelligence He published more than 40 researchespapers in the international journals and mainstream conferences includ-ing TDSC SPE TECS IJFCS SCIS CSCW ICSE FORMATS ATVAetc

Page 9: JOURNAL OF LA Easy-to-Deploy API Extraction by Multi-Level ... · sufficient high-quality training data for APIs of some less frequently discussed libraries or frameworks. Another

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 9

TABLE 4 Comparison of CRF Baselines [2] and Our Neural Model for API Extraction

Basic CRF Full CRF Our method

Library Prec Recall F1 Prec Recall F1 Prec Recall F1

Matplotlib 6735 4784 5662 8962 6131 7282 815 8392 827NumPy 7583 3991 5229 8921 4931 6351 7824 6291 6974Pandas 6206 7049 6601 9711 7121 8216 8273 853 8280Opengl 4391 9187 5942 9457 7062 8085 8583 8374 8477JDBC 1583 8261 2658 8732 5140 6471 8469 5531 6692React 1674 8858 2816 9742 7011 8153 8795 7817 8277

neural model has very similar performance with the samehyper-parameters under different dataset settings

613 Evaluation metricsWe use precision recall and F1-score to evaluate the per-formance of an API extraction method Precision measureswhat percentage the recognized API mentions by a methodare correct Recall measures what percentage the API men-tions in the testing dataset are recognized correctly by amethod F1-score is the harmonic mean of precision andrecall ie 2 lowast ((precision lowast recall)(precision+ recall))

62 Performance of Our Neural Model (RQ1)Motivation Recently linear-chain CRF has been used tosolve the API extraction problem in informal text [2] Theyshow that machine-learning based API extraction with thatof several outperforms several commonly-used rule-basedmethods [2] [42] [43] The approach in [2] relies on human-defined orthographic features two different unsupervisedlanguage models (class-based Brown clustering and neural-network based word embedding) and API gazetteer features(API inventory) In contrast our neural architecture usesneural networks to automatically learn character- word-and sentence-level features from the input texts The firstRQ is to confirm the effectiveness of these neural-networkfeature extractors for API extraction tasks by comparingthe overall performance of our neural architecture with theperformance of the linear-chain CRF with human-definedfeatures for API extractionApproach We use the implementation of the CRF modelin [2] to build the two CRF baselines The basic CRF baselineuses only the orthographic features developed in [2] Thefull CRF baseline uses all orthographic word-clusters andAPI gazetteer features developed in [2] The basic CRF base-line is easy to deploy because it uses only the orthographicfeatures in the input texts but not any advanced word-clusters and API gazetteer features And the self-trainingprocess is not used in these two baselines since we havesufficient training data In contrast the full CRF baselineuses advanced features so that it is not as easy-to-deploy asthe basic CRF But the full CRF has much better performancethan the basic CRF as reported in [2] Comparing our modelwith these two baselines we can understand whether ourmodel can achieve a good tradeoff between easy-to-deployand high performance Note that as Ye et al [2] alreadydemonstrates the superior performance of machine-learningbased API extraction methods over rule-based methods wedo not consider rule-based baselines for API extraction inthis studyResults From Table 4 we can see that

bull Although linear CRF with full features has close performance toour model linear CRF with only orthographic features performspoorly in distinguishing API tokens from non-API tokens Thebest F1-score of the basic CRF is only 066 for PandasThe F1-score of the basic CRF for Matplotlib Numpy andOpenGL is 052-059 For JDBC and React the F1-score ofthe basic CRF is below 03 Our results are consistent withthe results in [2] when advanced word-clusters and API-gazetteer features are ablated Compared with the basicCRF the full CRF performs much better For Pandas JDBCand React the full CRF has very close performance to ourmodel But for Matplotlib Numpy and OpenGL our modelstill outperforms the full CRF by at least four points inF1-score

bull Multi-level feature embedding by our neural architecture iseffective in distinguishing API tokens from non-API tokensin the resulting embedding space All evaluation metrics ofour neural architecture are significantly higher than thoseof the basic CRF Although our neural architecture per-forms relatively worse on NumPy and JDBC its F1-scoreis still much higher than the F1-score of the basic CRFon NumPy and JDBC We examine false positives (non-API token recognized as API) and false negatives (APItoken recognized as non-API) by our neural architectureon NumPy and JDBC We find that many false positivesand false negatives involve API mentions composed ofcomplex strings for example array expressions for Numpyor SQL statements for JDBC It seems that neural networkslearn some features from such complex strings that mayconfuse the model and cause the failure to tell apart APItokens from non-API ones Furthermore by analysing theclassification results for different API mentions we findthat our model has good generalizability on different APIfunctions

With linear CRF for API extraction we cannot achieve agood tradeoff between easy-to-deploy and high performance Incontrast our neural architecture can extract effective character-word- and sentence-level features from the input texts and theextracted features alone can support high-quality API extractionwithout using any hand-crafted advanced features such as wordclusters API gazetteers

63 Impact of Feature Extractors (RQ2)Motivation Although our neural architecture achieves verygood performance as a whole we would like to furtherinvestigate how much different feature extractors (character-level CNN word embeddings and sentence-context Bi-LSTM) contribute to this good performance and how dif-ferent features affect precision and recall of API extraction

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 10

TABLE 5 The Results of Feature Ablation Experiments

Ablating char CNN Ablating Word Embeddings Ablating Bi-LSTM All featuresPrec Recall F1-score Prec Recall F1-score Prec Recall F1-score Prec Recall F1-score

Matplotlib 7570 7232 7398 8175 6399 7179 8259 6934 7544 8150 8392 8270Numpy 8100 6040 6921 7931 5847 6731 8062 5667 6644 7824 6291 6974Pandas 8050 8328 8182 8177 6801 7425 8322 7522 7965 8273 8530 8280Opengl 7758 8537 8129 8307 6803 7504 9852 7208 8305 8583 8374 8477JDBC 6865 6125 6447 6422 6562 6491 9928 4312 6613 8469 5531 6692React 6990 8571 7701 8437 7500 7941 9879 6508 7847 8795 7817 8277

Approach We ablate one kind of feature at a time fromour neural architecture That is we obtain a model withoutcharacter-level CNN one without word embeddings andone without sentence-context Bi-LSTM We compare theperformance of these three models with that of the modelwith all three feature extractors

Results In Table 5 we highlight in bold the largest drop ineach metric for a library when ablating a feature comparedwith that metric with all features We underline the increasein each metric for a library when ablating a feature com-pared with that metric with all features We can see that

bull The performance of our neural architecture is contributed bythe combined action of all its features Ablating any of thefeatures the F1-score degrades Ablating word embed-dings or Bi-LSTM causes the largest drop in F1-scorefor four libraries while ablating char-CNN causes thelargest drop in F1-score for two libraries Feature ablationhas higher impact on some libraries than others Forexample ablating char-CNN word embeddings and Bi-LSTM all cause significant drop in F1-score for Matplotliband ablating word embeddings causes significant dropin F1-score for Pandas and React In contrast ablating afeature causes relative minor drop in F1-score for NumpyJDBC and React This indicates different levels of featureslearned by our neural architecture can be more distinct forsome libraries but more complementary for others Whendifferent features are more distinct from one anotherachieving good API extraction performance relies moreon all the features

bull Different features play different roles in distinguishing APItokens from non-API tokens Ablating char-CNN causesthe drop in precision for five libraries except NumpyThe precision degrades rather significantly for Matplotlib(71) OpenGL (96) JDBC (189) and React (205)In contrast ablating char-CNN causes significant drop inrecall only for Matplotlib (138) For JDBC and React ab-lating char-CNN even causes significant increase in recall(107 and 96 respectively) Different from the impactof ablating char-CNN ablating word embeddings and Bi-LSTM usually causes the largest drop in recall rangingfrom 99 drop for Numpy to 237 drop for MatplotlibAblating word embeddings causes significant drop inprecision only for JDBC and it causes small changes inprecision for the other five libraries (three with small dropand two with small increase) Ablating Bi-LSTM causesthe increase in precision for all six libraries Especiallyfor OpenGL JDBC and React when ablating Bi-LSTM themodel has almost perfect precision (around 99) but thiscomes with a significant drop in recall (139 for OpenGL193 for JDBC and 167 React)

All feature extractors are important for high-quality API ex-traction Char-CNN is especially useful for filtering out non-API tokens while word embeddings and Bi-LSTM are especiallyuseful for recalling API tokens

64 Effectiveness of Within-Language Transfer Learn-ing (RQ3)Motivation As described in Section 611 we intentionallychoose three different-functionality Python libraries PandasMatplotlib NumPy Pandas and NumPy are functionally sim-ilar while the other two pairs are more distant The threelibraries also have different API-naming and API-mentioncharacteristics (see Section 2) We want to investigate theeffectiveness of transfer learning for API extraction acrossdifferent pairs of libraries and with different amount oftarget-library training dataApproach We use one library as source library and one ofthe other two libraries as target library We have six transfer-learning pairs for the three libraries We denote them assource-library-name rarr target-library-name such as Pandas rarrNumPy which represents transferring Pandas-trained modelto NumPy text Two of these six pairs (Pandas rarr NumPy andNumPy rarr Pandas) are similar-libraries transfer (in terms oflibrary functionalities and API-namingmention character-istics) while the other four pairs involving Matplotlib arerelatively more-distant-libraries transfer

TABLE 6 NumPy (NP) or Pandas (PD) rarr Matplotlib (MPL)NPrarrMPL PDrarrMPL MPL

Prec Recall F1 Prec Recall F1 Prec Recall F111 8264 8929 8584 8102 8006 8058 8767 7827 827012 8184 8452 8316 7161 8333 7696 8138 7024 754014 7183 7589 7381 6788 7798 7265 8122 5536 658418 7056 7560 7298 6966 7381 7171 7500 5357 6250116 7356 7202 7278 6648 7202 6916 8070 2738 4089132 7256 7883 7369 7147 6935 7047 9750 1160 2074DU 7254 6607 6916 7699 5476 6400

TABLE 7 Matplotlib (MPL) or Pandas (PD) rarr NumPy (NP)MPLrarrNP PDrarrNP NP

Prec Recall F1 Prec Recall F1 Prec Recall F111 8685 7708 8168 7751 6750 7216 7824 6292 697412 788 8208 8041 7013 6751 6879 7513 6042 669714 9364 7667 8000 6588 7000 6781 7714 4500 568418 7373 7833 7596 6584 6667 6625 7103 4292 5351116 7619 6667 7111 5833 6417 6114 5707 4875 5258132 7554 5792 6556 6027 5625 5823 7272 2333 3533DU 6416 6525 6471 6296 5932 6108

We use gradually-reduced target-library training data(1 for all data 12 14 132) to fine-tune the source-library-trained model We also train a target-library modelfrom scratch (ie with randomly initialized model parame-ters) using the same proportion of the target-library trainingdata for comparison We also use the source-library-trained

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 11

TABLE 8 Matplotlib (MPL) or NumPy (NP) rarr Pandas (PD)MPLrarrPD NPrarrPD PD

Prec Recall F1 Prec Recall F1 Prec Recall F111 8497 8962 8723 8686 8761 8723 8043 8530 828012 8618 8443 8530 8397 8300 8348 8801 7406 804314 8107 8761 8421 8107 8270 8188 7650 7695 767218 8754 8098 8413 8576 7637 8079 6930 8329 7565

116 8204 7896 8047 8245 7579 7898 8421 5072 5331132 8131 7522 7814 8143 7205 7645 8425 3084 4514DU 7165 4006 5139 7500 3545 4814

TABLE 9 Averaged Matplotlib (MPL) or NumPy(NP)rarrPandas (PD)

MPLrarrPD NPrarrPDPrec Recall F1 Prec Recall F1

116 8333 7781 8043 8339 7522 7909132 7950 7378 7653 8615 7115 7830

model directly without any fine-tuning (ie 01 target-library data) as a baseline We also randomly select 116 and132 training data for NumPy rarr Pandas and Matplotlib rarrPandas for 10 times and train the model for each time Thenwe calculate the averaged precision and recall values Basedon the averaged precision and recall values we calculate theaveraged F1-scoreResults Table 6 Table 7 and Table 8 show the experimentresults of the six pairs of transfer learning Table 9 showsthe averaged experiment results of NumPy rarr Pandas andMatplotlib rarr Pandas with 116 and 132 training dataAcronyms stand for MPL (Matplotlib) NP (NumPy) PD(Pandas) DU (Direct Use) The last column is the results oftraining the target-library model from scratch We can seethatbull Transfer learning can produce a better-performance target-

library model than training the target-library model fromscratch with the same proportion of training data This is ev-ident as the F1-score of the target-library model obtainedby fine-tuning the source-library-trained model is higherthan the F1-score of the target-library model trained fromscratch in all the transfer settings but PD rarr MPL at11 Especially for NumPy the NumPy model trainedfrom scratch with all Numpy training data has F1-score6974 However the NumPy model transferred from theMatplotlib model has F1-score 8168 (171 improvement)For the Matplotlib rarr Nump transfer even with only 116Numpy training data for fine-tuning the F1-score (7111) ofthe transferred NumPy model is still higher than the F1-score of the NumPy model trained from scratch with allNumpy training data This suggests that transfer learningmay boost the model performance even for the difficultdataset Such performance boost by transfer learning hasalso been observed in many studies [14] [44] [45] Thereason is that a target-library model can ldquoreuserdquo muchknowledge in the source-library model rather than hav-ing to learning completely new knowledge from scratch

bull Transfer learning can reduce the amount of target-library train-ing data required to obtain a high-quality model For four outof six transfer-learning pairs (NP rarr MPL MPL rarr NP NPrarr PD MPL rarr PD) the reduction ranges from 50 (12)to 875 (78) while the resulting target-library model stillhas better F1-score than the target-library model trainedfrom scratch using all target-library training data If we al-low a small degradation of F1-score (eg 3 of the F1-sore

for the target-library model trained from scratch usingall target-library training data) the reduction can go upto 938 (1516) For example for Matplotlib rarr Pandasusing 116 Pandas training data (ie only about 20 posts)for fine-tuning the obtained Pandas model still has F1-score 8047

bull Transfer learning is very effective in few-shot training settingsFew-shot training refers to training a model with onlya very small amount of data [46] In our experimentsusing 132 training data ie about 10 posts for transferlearning the F1-score of the obtained target-library modelis still improved a few points for NP rarr MPL PD rarr MPLand MPL rarr NP and is significantly improved for MPL rarrPD (529) and NP rarr PD (583) compared with directlyreusing source-library-trained model without fine-tuning(DU row) Furthermore the averaged results of NP rarrPD and MPL rarr PD for 10 randomly selected 116 and132 training data are similar to the results in Table 8and the variances are all smaller than 03 which showsour training data is unbiased Although the target-librarymodel trained from scratch still has reasonable F1-scorewith 12 to 18 training data they become completelyuseless with 116 or 132 training data In contrast thetarget-library models obtained through transfer learninghave significant better F1-score in such few shot settingsFurthermore training a model from scratch using few-shot data may result in abnormal increase in precision(eg MPL at 116 and 132 NP at 132 PD at 132) or inrecall (eg NP at 116) compared with training the modelfrom scratch using more data This abnormal increase inprecision (or recall) comes with a sharp decrease in recall(or precision) Our analysis shows that this phenomenonis caused by the biased training data in few-shot settingsIn contrast the target-library models obtained throughtransfer learning can reuse knowledge in the source modeland produce a much more balanced precision and recall(thus much better F1-score) even in the face of biased few-shot training data

bull The effectiveness of transfer learning does not correlate withthe quality of source-library-trained model Based on a not-so-high-quality source model (eg NumPy) we can still ob-tain a high-quality target model through transfer learningFor example NP rarr PD results in a Pandas model with F1-score gt 80 using only 18 of Pandas training data On theother hand a high-quality source model (eg Pandas) maynot boost the performance of the target model Amongall 36 transfer settings we have one such case ie PDrarr MPL at 11 The Matplotlib model transferred fromthe Pandas model using all Matplotlib training data isslightly worse than the Matplotlib model trained fromscratch using all training data This can be attributed tothe differences of API naming characteristics between thetwo libraries (see the last bullet)

bull The more similar the functionalities and the API-namingmention characteristics between the two libraries arethe more effectiveness the transfer learning can be Pan-das and NumPy have similar functionalities and API-namingmention characteristics For PD rarr NP and NPrarr PD the target-library model is less than 5 worse inF1-score with 18 target-library training data comparedwith the target-library model trained from scratch using

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 12

all training data In contrast PD rarr MPL has 69 dropin F1-score with 12 training data and NP rarr MPL has107 drop in F1-score with 14 training data comparedwith the Matplotlib model trained from scratch using alltraining data

bull Knowledge about non-polysemous APIs seems easy to expandwhile knowledge about polysemous APIs seems difficult toadapt Matplotlib has the least polysemous APIs (16)and Pandas and NumPy has much more polysemous APIsBoth MPL rarr PD and MPL rarr NP have high-qualitytarget models even with 18 target-library training dataThe transfer learning seems to be able to add the newknowledge ldquocommon words can be APIsrdquo to the target-library model In contrast NP rarr MPL requires at 12training data to have a high-quality target model (F1-score8316) and for PD rarr MPL the Matplotlib model (F1-score8058) by transfer learning with all training data is evenslightly worse than the model trained from scratch (F1-score 827) These results suggest that it can be difficultto adapt the knowledge ldquocommon words can be APIsrdquolearned from NumPy and Pandas to Matplotlib which doesnot have many common-word APIs

Transfer learning can effectively boost the performance of target-library model with less demand on target-library training dataIts effectiveness correlates more with the similarity of libraryfunctionalities and and API-namingmention characteristicsthan the initial quality of source-library model

65 Effectiveness of Across-Language Transfer Learn-ing (RQ4)Motivation In the RQ3 we investigate the transferabilityof API extraction model across three Python libraries Inthe RQ4 we want to further investigate the transferabilityof API extraction model in a more challenging setting ieacross different programming languages and across librarieswith very different functionalitiesApproach For the source language we choose Python oneof the most popular programming languages We randomlyselect 200 posts in each dataset of the three Python librariesand combine these 600 posts as the training data for PythonAPI extraction model For the target languages we choosethree other popular programming languages Java JavaScriptand C That is we have three transfer-learning pairs in thisRQ Python rarr Java Python rarr JavaScript and Python rarrC For the three target languages we intentionally choosethe libraries that support very different functionalities fromthe three Python libraries For Java we choose JDBC (anAPI for accessing relational database) For JavaScript wechoose React (a library for web graphical user interface)For C we choose OpenGL (a library for 3D graphics) Asdescribed in Section 612 we label 600 posts for each target-language library for the experiments As in the RQ3 weuse gradually-reduced target-language training data to fine-tune the source-language-trained modelResults Table 10 Table 11 and Table 12 show the experimentresults for the three across-language transfer-learning pairsThese three tables use the same notation as Table 6 Table 7and Table 8 We can see thatbull Directly reusing the source-language-trained model on the

target-language text produces unacceptable performance For

TABLE 10 Python rarr Java

PythonrarrJava JavaPrec Recall F1 Prec Recall F1

11 7745 6656 7160 8469 5531 669212 7220 6250 6706 7738 5344 632214 6926 5000 5808 7122 4719 567718 5000 6406 5616 7515 3969 5194

116 5571 4825 5200 7569 3406 4698132 5699 4000 4483 7789 2312 3566DU 4444 2875 3491

TABLE 11 Python rarr JavaScript

PythonrarrJavaScript JavaScriptPrec Recall F1 Prec Recall F1

11 7745 8175 8619 8795 7817 827712 8693 7659 8143 8756 5984 777014 8684 6865 7424 8308 6626 737218 8148 6111 6984 8598 5595 6788

116 7111 6349 6808 8738 3848 5344132 6667 5238 5867 6521 3571 4615DU 5163 2500 3369

the three Python libraries directly reusing a source-library-trained model on the text of a target-library maystill have reasonable performance for example NP rarrMPL (F1-score 6916) PD rarr MPL (640) MPL rarr NP(6471) However for the three across-language transfer-learning pairs the F1-score of direct model reuse is verylow Python rarr Java (3491) Python rarr JavaScript (3369)Python rarr C (3595) These results suggest that there arestill certain level of commonalities between the librarieswith similar functionalities and of the same programminglanguage so that the knowledge learned in the source-library model may be largely reused in the target-librarymodel In contrast the libraries of different functionali-ties and programming languages have much fewer com-monalities which makes it infeasible to directly deploya source-language-trained model to the text of anotherlanguage In such cases we have to either train the modelfor each language or library from scratch or we may ex-ploit transfer learning to adapt a source-language-trainedmodel to the target-language text

bull Across-language transfer learning holds the same performancecharacteristics as within-language transfer learning but itdemands more target-language training data to obtain a high-quality model than within-language transfer learning ForPython rarr JavaScript and Python rarr C with as minimumas 116 target-language training data we can obtaina fine-tuned target-language model whose F1-score candouble that of directly reusing the Python-trained modelto JavaScript or C text For Python rarr Java fine-tuning thePython-trained model with 116 Java training data booststhe F1-score by 50 compared with directly applying thePython-trained model to Java text For all the transfer-learning settings in Table 10 Table 11 and Table 12the fine-tuned target-language model always outperformsthe corresponding target-language model trained fromscratch with the same amount of target-language trainingdata However it requires at least 12 target-languagetraining data to fine-tune a target-language model toachieve the same level of performance as the target-language model trained from scratch with all target-

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 13

TABLE 12 Python rarr C

PythonrarrC CPrec Recall F1 Prec Recall F1

11 8987 8509 8747 8583 8374 847712 8535 8373 8454 8628 7669 812114 8383 7588 7965 8219 7127 763418 7432 7371 7401 7868 6802 7297

116 7581 6945 7260 8852 5710 6942132 6962 6585 6779 8789 4525 5975DU 5649 2791 3595

language training data For the within-language transferlearning achieving this result may require as minimum as18 target-library training data

Software text of different programming languages and of li-braries with very different functionalities increases the difficultyof transfer learning but transfer learning can still effectivelyboost the performance of the target-language model with 12less target-language training data compared with training thetarget-language model from scratch with all training data

66 Threats to Validity

The major threat to internal validity is the API labelingerrors in the dataset In order to decrease errors the first twoauthors first independently label the same data and thenresolve the disagreements by discussion Furthermore inour experiments we have to examine many API extractionresults We only spot a few instances of overlooked orerroneously-labeled API mentions Therefore the quality ofour dataset is trustworthy

The major threat to external validity is the generalizationof our results and findings Although we invest significanttime and effort to prepare datasets conduct experimentsand analyze results our experiments involve only six li-braries of four programming languages The performance ofour neural architecture and especially the findings on trans-fer learning could be different with other programminglanguages and libraries Furthermore the software text inall the experiments comes from a single data source ieStack Overflow Software text from other data sources mayexhibit different API-mention and discussion-context char-acteristics which may affect the model performance In thefuture we will reduce this threat by applying our approachto more languageslibraries and informal software text fromother data sources (eg developer emails)

7 RELATED WORK

APIs are the core resources for software development andAPI related knowledge is widely present in software textssuch as API reference documentation QampA discussionsbug reports Researchers have proposed many API extrac-tion methods to support various software engineering tasksespecially document traceability recovery [42] [47] [48][49] [50] For example RecoDoc [3] extracts Java APIsfrom several resources and then recover traceability acrossdifferent sources Subramanian et al [4] use code contextinformation to filter candidate APIs in a knowledge basefor an API mention in a partial code fragment Christophand Robillard [51] extracts sentences about API usage from

Stack Overflow and use them to augment API referencedocumentation These works focus mainly on the API link-ing task ie linking API mentions in text or code to someAPI entities in a knowledge base In terms of extracting APImentions in software text they rely on rule-based methodsfor example regular expressions of distinct orthographicfeatures of APIs such as camelcase special characters (eg or ()) and API annotations

Several studies such as Bacchelli et al [42] [43] and Yeet al [2] show that regular expressions of distinct ortho-graphic features are not reliable for API extraction tasks ininformal texts such as emails Stack Overflow posts Bothvariations of sentence formats and the wide presence ofmentions of polysemous API simple names pose a greatchallenge for API extraction in informal texts Howeverthis challenge is generally avoided by considering onlyAPI mentions with distinct orthographic features in existingworks [3] [51] [52]

Island parsing provides a more robust solution for ex-tracting API mentions from texts Using an island grammarwe can separate the textual content into constructs of in-terest (island) and the remainder (water) [53] For exampleBacchelli et al [54] uses island parsing to extract code frag-ments from natural language text Rigby and Robillard [52]also use island parser to identify code-like elements that canpotentially be APIs However these island parsers cannoteffectively deal with mentions of API simple names that arenot suffixed by () such as apply and series in Fig 1 whichare very common writing forms of API methods in StackOverflow discussions [2]

Recently Ye et al [10] proposes a CRF based approachfor extracting methods of software entities such as pro-gramming languages libraries computing concepts andAPIs from informal software texts They report that extractAPI mentions is much more challenging than extractingother types of software entities A follow-up work by Yeet al [2] proposes to use a combination of orthographicword-clusters and API gazetteer features to enhance theCRFrsquos performance on API extraction tasks Although theseproposed features are effective they require much manualeffort to develop and tune Furthermore enough trainingdata has to be prepared and manually labeled for applyingtheir approach to the software text of each library to beprocessed These overheads pose a practical limitation todeploying their approach to a larger number of libraries

Our neural architecture is inspired by recent advancesof neural network techniques for NLP tasks For exampleboth RNNs and CNNs have been used to embed character-level features for question answering [55] [56] machinetranslation [57] text classification [58] and part-of-speechtagging [59] Some researchers also use word embeddingsand LSTMs for NER [35] [60] [61] To the best of our knowl-edge our neural architecture is the first machine learningbased API extraction method that combines these proposalsand customize them based on the characteristics of soft-ware texts and API names Furthermore the design of ourneural architecture also takes into account the deploymentoverhead of the API methods for multiple programminglanguages and libraries which has never been explored forAPI extraction tasks

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 14

8 CONCULSION

This paper presents a novel multi-layer neural architecturethat can effectively learn important character- word- andsentence-level features from informal software texts for ex-tracting API mentions in text The learned features has supe-rior performance than human-defined orthographic featuresfor API extraction in informal software texts This makes ourneural architecture easy to deploy because the only input itrequires is the texts to be processed In contrast existingmachine learning based API extraction methods have touse additional hand-crafted features such as word clustersor API gazetteers in order to achieve the performanceclose to that of our neural architecture Furthermore as thefeatures are automatically learned from the input texts ourneural architecture is easy to transfer and fine-tune acrossprogramming languages and libraries We demonstrate itstransferability across three Python libraries and across fourprogramming languages Our neural architecture togetherwith transfer learning makes it easy to train and deploya high-quality API extraction model for multiple program-ming languages and libraries with much less overall effortrequired for preparing training data and effective featuresIn the future we will further investigate the performanceand the transferability of our neural architecture in manyother programming languages and libraries moving to-wards real-world deployment of machine learning basedAPI extraction methods

REFERENCES

[1] C Parnin C Treude L Grammel and M-A Storey ldquoCrowddocumentation Exploring the coverage and the dynamics of apidiscussions on stack overflowrdquo

[2] D Ye Z Xing C Y Foo J Li and N Kapre ldquoLearning to extractapi mentions from informal natural language discussionsrdquo in Soft-ware Maintenance and Evolution (ICSME) 2016 IEEE InternationalConference on IEEE 2016 pp 389ndash399

[3] B Dagenais and M P Robillard ldquoRecovering traceability linksbetween an api and its learning resourcesrdquo in Software Engineering(ICSE) 2012 34th International Conference on IEEE 2012 pp 47ndash57

[4] S Subramanian L Inozemtseva and R Holmes ldquoLive api doc-umentationrdquo in Proceedings of the 36th International Conference onSoftware Engineering ACM 2014 pp 643ndash652

[5] C Chen Z Xing and X Wang ldquoUnsupervised software-specificmorphological forms inference from informal discussionsrdquo in Pro-ceedings of the 39th International Conference on Software EngineeringIEEE Press 2017 pp 450ndash461

[6] X Chen C Chen D Zhang and Z Xing ldquoSethesaurus Wordnetin software engineeringrdquo IEEE Transactions on Software Engineer-ing 2019

[7] H Li S Li J Sun Z Xing X Peng M Liu and X Zhao ldquoIm-proving api caveats accessibility by mining api caveats knowledgegraphrdquo in Software Maintenance and Evolution (ICSME) 2018 IEEEInternational Conference on IEEE 2018

[8] Z X D L X W Qiao Huang Xin Xia ldquoApi method recom-mendation without worrying about the task-api knowledge gaprdquoin Automated Software Engineering (ASE) 2018 33th IEEEACMInternational Conference on IEEE 2018

[9] B Xu Z Xing X Xia and D Lo ldquoAnswerbot automated gen-eration of answer summary to developersz technical questionsrdquoin Proceedings of the 32nd IEEEACM International Conference onAutomated Software Engineering IEEE Press 2017 pp 706ndash716

[10] D Ye Z Xing C Y Foo Z Q Ang J Li and N Kapre ldquoSoftware-specific named entity recognition in software engineering socialcontentrdquo in 2016 IEEE 23rd International Conference on SoftwareAnalysis Evolution and Reengineering (SANER) vol 1 IEEE 2016pp 90ndash101

[11] E F Tjong Kim Sang and F De Meulder ldquoIntroduction to the conll-2003 shared task Language-independent named entity recogni-tionrdquo in Proceedings of the seventh conference on Natural languagelearning at HLT-NAACL 2003-Volume 4 Association for Compu-tational Linguistics 2003 pp 142ndash147

[12] R Caruana ldquoLearning many related tasks at the same time withbackpropagationrdquo in Advances in neural information processing sys-tems 1995 pp 657ndash664

[13] A Arnold R Nallapati and W W Cohen ldquoExploiting featurehierarchy for transfer learning in named entity recognitionrdquo Pro-ceedings of ACL-08 HLT pp 245ndash253 2008

[14] J Yosinski J Clune Y Bengio and H Lipson ldquoHow transferableare features in deep neural networksrdquo in Advances in neuralinformation processing systems 2014 pp 3320ndash3328

[15] Z X Deheng Ye Lingfeng Bao and S-W Lin ldquoApireal An apirecognition and linking approach for online developer forumsrdquoEmpirical Software Engineering 2018 2018

[16] C dos Santos and M Gatti ldquoDeep convolutional neural networksfor sentiment analysis of short textsrdquo in Proceedings of COLING2014 the 25th International Conference on Computational LinguisticsTechnical Papers 2014 pp 69ndash78

[17] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[18] G Chen C Chen Z Xing and B Xu ldquoLearning a dual-languagevector space for domain-specific cross-lingual question retrievalrdquoin 2016 31st IEEEACM International Conference on Automated Soft-ware Engineering (ASE) IEEE 2016 pp 744ndash755

[19] C Chen X Chen J Sun Z Xing and G Li ldquoData-driven proac-tive policy assurance of post quality in community qampa sitesrdquoProceedings of the ACM on human-computer interaction vol 2 noCSCW p 33 2018

[20] I Bazzi ldquoModelling out-of-vocabulary words for robust speechrecognitionrdquo PhD dissertation Massachusetts Institute of Tech-nology 2002

[21] C Chen S Gao and Z Xing ldquoMining analogical libraries in qampadiscussionsndashincorporating relational and categorical knowledgeinto word embeddingrdquo in 2016 IEEE 23rd international conferenceon software analysis evolution and reengineering (SANER) vol 1IEEE 2016 pp 338ndash348

[22] C Chen and Z Xing ldquoSimilartech automatically recommendanalogical libraries across different programming languagesrdquo in2016 31st IEEEACM International Conference on Automated SoftwareEngineering (ASE) IEEE 2016 pp 834ndash839

[23] C Chen Z Xing and Y Liu ldquoWhats spains paris mining analogi-cal libraries from qampa discussionsrdquo Empirical Software Engineeringvol 24 no 3 pp 1155ndash1194 2019

[24] Y Huang C Chen Z Xing T Lin and Y Liu ldquoTell them apartdistilling technology differences from crowd-scale comparisondiscussionsrdquo in ASE 2018 pp 214ndash224

[25] O Levy and Y Goldberg ldquoNeural word embedding as implicitmatrix factorizationrdquo in Advances in neural information processingsystems 2014 pp 2177ndash2185

[26] D Tang F Wei N Yang M Zhou T Liu and B Qin ldquoLearningsentiment-specific word embedding for twitter sentiment classifi-cationrdquo in Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) vol 1 2014pp 1555ndash1565

[27] O Levy Y Goldberg and I Dagan ldquoImproving distributional sim-ilarity with lessons learned from word embeddingsrdquo Transactionsof the Association for Computational Linguistics vol 3 pp 211ndash2252015

[28] J Pennington R Socher and C Manning ldquoGlove Global vectorsfor word representationrdquo in Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP) 2014 pp1532ndash1543

[29] I Sutskever O Vinyals and Q V Le ldquoSequence to sequencelearning with neural networksrdquo in Advances in neural informationprocessing systems 2014 pp 3104ndash3112

[30] C Chen Z Xing and Y Liu ldquoBy the community amp for the com-munity a deep learning approach to assist collaborative editing inqampa sitesrdquo Proceedings of the ACM on Human-Computer Interactionvol 1 no CSCW p 32 2017

[31] S Gao C Chen Z Xing Y Ma W Song and S-W Lin ldquoA neuralmodel for method name generation from functional descriptionrdquo

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 15

in 2019 IEEE 26th International Conference on Software AnalysisEvolution and Reengineering (SANER) IEEE 2019 pp 414ndash421

[32] X Wang C Chen and Z Xing ldquoDomain-specific machine trans-lation with recurrent neural network for software localizationrdquoEmpirical Software Engineering pp 1ndash32 2019

[33] C Chen Z Xing Y Liu and K L X Ong ldquoMining likelyanalogical apis across third-party libraries via large-scale unsu-pervised api semantics embeddingrdquo IEEE Transactions on SoftwareEngineering 2019

[34] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no 11pp 2673ndash2681 1997

[35] Z Huang W Xu and K Yu ldquoBidirectional lstm-crf models forsequence taggingrdquo arXiv preprint arXiv150801991 2015

[36] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural computation vol 9 no 8 pp 1735ndash1780 1997

[37] H Sak A Senior and F Beaufays ldquoLong short-term memoryrecurrent neural network architectures for large scale acousticmodelingrdquo in Fifteenth annual conference of the international speechcommunication association 2014

[38] Y Bengio ldquoDeep learning of representations for unsupervised andtransfer learningrdquo in Proceedings of ICML Workshop on Unsupervisedand Transfer Learning 2012 pp 17ndash36

[39] S J Pan Q Yang et al ldquoA survey on transfer learningrdquo[40] N Srivastava G Hinton A Krizhevsky I Sutskever and

R Salakhutdinov ldquoDropout A simple way to prevent neural net-works from overfittingrdquo The Journal of Machine Learning Researchvol 15 no 1 pp 1929ndash1958 2014

[41] D P Kingma and J Ba ldquoAdam A method for stochastic optimiza-tionrdquo arXiv preprint arXiv14126980 2014

[42] A Bacchelli M Lanza and R Robbes ldquoLinking e-mails andsource code artifactsrdquo in Proceedings of the 32nd ACMIEEE Inter-national Conference on Software Engineering-Volume 1 ACM 2010pp 375ndash384

[43] A Bacchelli M DrsquoAmbros M Lanza and R Robbes ldquoBench-marking lightweight techniques to link e-mails and source coderdquoin Reverse Engineering 2009 WCRErsquo09 16th Working Conference onIEEE 2009 pp 205ndash214

[44] S J Pan and Q Yang ldquoA survey on transfer learningrdquo IEEETransactions on knowledge and data engineering vol 22 no 10 pp1345ndash1359 2010

[45] M Oquab L Bottou I Laptev and J Sivic ldquoLearning and transfer-ring mid-level image representations using convolutional neuralnetworksrdquo in Computer Vision and Pattern Recognition (CVPR) 2014IEEE Conference on IEEE 2014 pp 1717ndash1724

[46] S Ravi and H Larochelle ldquoOptimization as a model for few-shotlearningrdquo 2016

[47] A Marcus J Maletic et al ldquoRecovering documentation-to-source-code traceability links using latent semantic indexingrdquo in Soft-ware Engineering 2003 Proceedings 25th International Conference onIEEE 2003 pp 125ndash135

[48] H-Y Jiang T N Nguyen X Chen H Jaygarl and C K ChangldquoIncremental latent semantic indexing for automatic traceabil-ity link evolution managementrdquo in Proceedings of the 2008 23rdIEEEACM International Conference on Automated Software Engineer-ing IEEE Computer Society 2008 pp 59ndash68

[49] W Zheng Q Zhang and M Lyu ldquoCross-library api recommen-dation using web search enginesrdquo in Proceedings of the 19th ACMSIGSOFT symposium and the 13th European conference on Foundationsof software engineering ACM 2011 pp 480ndash483

[50] Q Gao H Zhang J Wang Y Xiong L Zhang and H Mei ldquoFixingrecurring crash bugs via analyzing qampa sites (t)rdquo in AutomatedSoftware Engineering (ASE) 2015 30th IEEEACM International Con-ference on IEEE 2015 pp 307ndash318

[51] C Treude and M P Robillard ldquoAugmenting api documentationwith insights from stack overflowrdquo in Software Engineering (ICSE)2016 IEEEACM 38th International Conference on IEEE 2016 pp392ndash403

[52] P C Rigby and M P Robillard ldquoDiscovering essential codeelements in informal documentationrdquo in Proceedings of the 2013International Conference on Software Engineering IEEE Press 2013pp 832ndash841

[53] L Moonen ldquoGenerating robust parsers using island grammarsrdquoin Reverse Engineering 2001 Proceedings Eighth Working Conferenceon IEEE 2001 pp 13ndash22

[54] A Bacchelli A Cleve M Lanza and A Mocci ldquoExtracting struc-tured data from natural language documents with island parsingrdquo

in Automated Software Engineering (ASE) 2011 26th IEEEACMInternational Conference on IEEE 2011 pp 476ndash479

[55] Y Kim Y Jernite D Sontag and A M Rush ldquoCharacter-awareneural language modelsrdquo in AAAI 2016 pp 2741ndash2749

[56] D Lukovnikov A Fischer J Lehmann and S Auer ldquoNeuralnetwork-based question answering over knowledge graphs onword and character levelrdquo in Proceedings of the 26th internationalconference on World Wide Web International World Wide WebConferences Steering Committee 2017 pp 1211ndash1220

[57] W Ling I Trancoso C Dyer and A W Black ldquoCharacter-basedneural machine translationrdquo arXiv preprint arXiv151104586 2015

[58] X Zhang J Zhao and Y LeCun ldquoCharacter-level convolutionalnetworks for text classificationrdquo in Advances in neural informationprocessing systems 2015 pp 649ndash657

[59] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[60] J P Chiu and E Nichols ldquoNamed entity recognition with bidirec-tional lstm-cnnsrdquo arXiv preprint arXiv151108308 2015

[61] G Lample M Ballesteros S Subramanian K Kawakami andC Dyer ldquoNeural architectures for named entity recognitionrdquoarXiv preprint arXiv160301360 2016

Suyu Ma is a research assistant in the Facultyof Information Technology at Monash UniversityHe has research interest in the areas of soft-ware engineering Deep learning and Human-computer Interaction He is currently focusing onimproving the usability and accessibility of mo-bile application He received the BS and MSdegrees from Beijing Technology and BusinessUniversity and the Australian National Universityin 2016 and 2018 respectively And he will bea PhD student at Monash University in 2020

under the supervision of Chunyang Chen

Zhenchang Xing is a Senior Lecturer in the Re-search School of Computer Science AustralianNational University Previously he was an Assis-tant Professor in the School of Computer Sci-ence and Engineering Nanyang TechnologicalUniversity Singapore from 2012-2016 Beforejoining NTU Dr Xing was a Lee Kuan Yew Re-search Fellow in the School of Computing Na-tional University of Singapore from 2009-2012Dr Xings current research area is in the interdis-plinary areas of software engineering human-

computer interaction and applied data analytics Dr Xing has over 100publications in peer-refereed journals and conference proceedings andhave received several distinguished paper awards from top softwareengineering conferences Dr Xing regularly serves on the organizationand program committees of the top software engineering conferencesand he will be the program committee co-chair for ICSME2020

Chunyang Chen is a lecturer (Assistant Pro-fessor) in Faculty of Information TechnologyMonash University Australia His research fo-cuses on software engineering deep learningand human-computer interaction He has pub-lished over 16 papers in referred journals orconferences including Empirical Software Engi-neering ICSE ASE CSCW ICSME SANERHe is a member of IEEE and ACM He receivedACM SIGSOFT distinguished paper award inASE 2018 best paper award in SANER 2016

and best tool demo in ASE 2016 httpschunyang-chengithubio

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 16

Cheng Chen received his Bachelor degree inSoftware Engineering from Northwest University(China) and received Master degree of Com-puting (major in Artificial Intelligence) from theAustralian National University He did some in-ternship projects about Natural Language Pro-cessing at CSIRO from 2016 to 2017 He workedas a research assistant supervised by Dr Zhen-chang Xing at ANU in 2018 Cheng currentlyworks in PricewaterhouseCoopers (PwC) Firmas a senior algorithm engineer of Natural Lan-

guage Processing and Data Analyzing Cheng is interested in NamedEnti ty Extraction Relation Extraction Text summarization and parallelcomputing He is working on knowledge engineering and transaction forNLP tasks

Lizhen Qu is a research fellow in the DialogueResearch lab in the Faculty of Information Tech-nology at Monash University He has exten-sive research experience in the areas of naturallanguage processing multimodal learning deeplearning and Cybersecurity He is currently fo-cusing on information extraction semantic pars-ing and multimodal dialogue systems Prior tojoining Monash University he worked as a re-search scientist at Data61CSIRO where he ledand participated in several research and indus-

trial projects including Deep Learning for Text and Deep Learning forCyber

Guoqiang Li is now an associate professor inschool of software Shanghai Jiao Tong Univer-sity and a guest associate professor in KyushuUniversity He received the BS MS and PhDdegrees from Taiyuan University of TechnologyShanghai Jiao Tong University and Japan Ad-vanced Institute of Science and Technology in2001 2005 and 2008 respectively He workedas a postdoctoral research fellow in the graduateschool of information science Nagoya Univer-sity Japan during 2008-2009 as an assistant

professor in the school of software Shanghai Jiao Tong Universityduring 2009-2013 and as an academic visitor in the department ofcomputer science University of Oxford during 2015-2016 His researchinterests include formal verification programming language theory anddata analytics and intelligence He published more than 40 researchespapers in the international journals and mainstream conferences includ-ing TDSC SPE TECS IJFCS SCIS CSCW ICSE FORMATS ATVAetc

Page 10: JOURNAL OF LA Easy-to-Deploy API Extraction by Multi-Level ... · sufficient high-quality training data for APIs of some less frequently discussed libraries or frameworks. Another

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 10

TABLE 5 The Results of Feature Ablation Experiments

Ablating char CNN Ablating Word Embeddings Ablating Bi-LSTM All featuresPrec Recall F1-score Prec Recall F1-score Prec Recall F1-score Prec Recall F1-score

Matplotlib 7570 7232 7398 8175 6399 7179 8259 6934 7544 8150 8392 8270Numpy 8100 6040 6921 7931 5847 6731 8062 5667 6644 7824 6291 6974Pandas 8050 8328 8182 8177 6801 7425 8322 7522 7965 8273 8530 8280Opengl 7758 8537 8129 8307 6803 7504 9852 7208 8305 8583 8374 8477JDBC 6865 6125 6447 6422 6562 6491 9928 4312 6613 8469 5531 6692React 6990 8571 7701 8437 7500 7941 9879 6508 7847 8795 7817 8277

Approach We ablate one kind of feature at a time fromour neural architecture That is we obtain a model withoutcharacter-level CNN one without word embeddings andone without sentence-context Bi-LSTM We compare theperformance of these three models with that of the modelwith all three feature extractors

Results In Table 5 we highlight in bold the largest drop ineach metric for a library when ablating a feature comparedwith that metric with all features We underline the increasein each metric for a library when ablating a feature com-pared with that metric with all features We can see that

bull The performance of our neural architecture is contributed bythe combined action of all its features Ablating any of thefeatures the F1-score degrades Ablating word embed-dings or Bi-LSTM causes the largest drop in F1-scorefor four libraries while ablating char-CNN causes thelargest drop in F1-score for two libraries Feature ablationhas higher impact on some libraries than others Forexample ablating char-CNN word embeddings and Bi-LSTM all cause significant drop in F1-score for Matplotliband ablating word embeddings causes significant dropin F1-score for Pandas and React In contrast ablating afeature causes relative minor drop in F1-score for NumpyJDBC and React This indicates different levels of featureslearned by our neural architecture can be more distinct forsome libraries but more complementary for others Whendifferent features are more distinct from one anotherachieving good API extraction performance relies moreon all the features

bull Different features play different roles in distinguishing APItokens from non-API tokens Ablating char-CNN causesthe drop in precision for five libraries except NumpyThe precision degrades rather significantly for Matplotlib(71) OpenGL (96) JDBC (189) and React (205)In contrast ablating char-CNN causes significant drop inrecall only for Matplotlib (138) For JDBC and React ab-lating char-CNN even causes significant increase in recall(107 and 96 respectively) Different from the impactof ablating char-CNN ablating word embeddings and Bi-LSTM usually causes the largest drop in recall rangingfrom 99 drop for Numpy to 237 drop for MatplotlibAblating word embeddings causes significant drop inprecision only for JDBC and it causes small changes inprecision for the other five libraries (three with small dropand two with small increase) Ablating Bi-LSTM causesthe increase in precision for all six libraries Especiallyfor OpenGL JDBC and React when ablating Bi-LSTM themodel has almost perfect precision (around 99) but thiscomes with a significant drop in recall (139 for OpenGL193 for JDBC and 167 React)

All feature extractors are important for high-quality API ex-traction Char-CNN is especially useful for filtering out non-API tokens while word embeddings and Bi-LSTM are especiallyuseful for recalling API tokens

64 Effectiveness of Within-Language Transfer Learn-ing (RQ3)Motivation As described in Section 611 we intentionallychoose three different-functionality Python libraries PandasMatplotlib NumPy Pandas and NumPy are functionally sim-ilar while the other two pairs are more distant The threelibraries also have different API-naming and API-mentioncharacteristics (see Section 2) We want to investigate theeffectiveness of transfer learning for API extraction acrossdifferent pairs of libraries and with different amount oftarget-library training dataApproach We use one library as source library and one ofthe other two libraries as target library We have six transfer-learning pairs for the three libraries We denote them assource-library-name rarr target-library-name such as Pandas rarrNumPy which represents transferring Pandas-trained modelto NumPy text Two of these six pairs (Pandas rarr NumPy andNumPy rarr Pandas) are similar-libraries transfer (in terms oflibrary functionalities and API-namingmention character-istics) while the other four pairs involving Matplotlib arerelatively more-distant-libraries transfer

TABLE 6 NumPy (NP) or Pandas (PD) rarr Matplotlib (MPL)NPrarrMPL PDrarrMPL MPL

Prec Recall F1 Prec Recall F1 Prec Recall F111 8264 8929 8584 8102 8006 8058 8767 7827 827012 8184 8452 8316 7161 8333 7696 8138 7024 754014 7183 7589 7381 6788 7798 7265 8122 5536 658418 7056 7560 7298 6966 7381 7171 7500 5357 6250116 7356 7202 7278 6648 7202 6916 8070 2738 4089132 7256 7883 7369 7147 6935 7047 9750 1160 2074DU 7254 6607 6916 7699 5476 6400

TABLE 7 Matplotlib (MPL) or Pandas (PD) rarr NumPy (NP)MPLrarrNP PDrarrNP NP

Prec Recall F1 Prec Recall F1 Prec Recall F111 8685 7708 8168 7751 6750 7216 7824 6292 697412 788 8208 8041 7013 6751 6879 7513 6042 669714 9364 7667 8000 6588 7000 6781 7714 4500 568418 7373 7833 7596 6584 6667 6625 7103 4292 5351116 7619 6667 7111 5833 6417 6114 5707 4875 5258132 7554 5792 6556 6027 5625 5823 7272 2333 3533DU 6416 6525 6471 6296 5932 6108

We use gradually-reduced target-library training data(1 for all data 12 14 132) to fine-tune the source-library-trained model We also train a target-library modelfrom scratch (ie with randomly initialized model parame-ters) using the same proportion of the target-library trainingdata for comparison We also use the source-library-trained

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 11

TABLE 8 Matplotlib (MPL) or NumPy (NP) rarr Pandas (PD)MPLrarrPD NPrarrPD PD

Prec Recall F1 Prec Recall F1 Prec Recall F111 8497 8962 8723 8686 8761 8723 8043 8530 828012 8618 8443 8530 8397 8300 8348 8801 7406 804314 8107 8761 8421 8107 8270 8188 7650 7695 767218 8754 8098 8413 8576 7637 8079 6930 8329 7565

116 8204 7896 8047 8245 7579 7898 8421 5072 5331132 8131 7522 7814 8143 7205 7645 8425 3084 4514DU 7165 4006 5139 7500 3545 4814

TABLE 9 Averaged Matplotlib (MPL) or NumPy(NP)rarrPandas (PD)

MPLrarrPD NPrarrPDPrec Recall F1 Prec Recall F1

116 8333 7781 8043 8339 7522 7909132 7950 7378 7653 8615 7115 7830

model directly without any fine-tuning (ie 01 target-library data) as a baseline We also randomly select 116 and132 training data for NumPy rarr Pandas and Matplotlib rarrPandas for 10 times and train the model for each time Thenwe calculate the averaged precision and recall values Basedon the averaged precision and recall values we calculate theaveraged F1-scoreResults Table 6 Table 7 and Table 8 show the experimentresults of the six pairs of transfer learning Table 9 showsthe averaged experiment results of NumPy rarr Pandas andMatplotlib rarr Pandas with 116 and 132 training dataAcronyms stand for MPL (Matplotlib) NP (NumPy) PD(Pandas) DU (Direct Use) The last column is the results oftraining the target-library model from scratch We can seethatbull Transfer learning can produce a better-performance target-

library model than training the target-library model fromscratch with the same proportion of training data This is ev-ident as the F1-score of the target-library model obtainedby fine-tuning the source-library-trained model is higherthan the F1-score of the target-library model trained fromscratch in all the transfer settings but PD rarr MPL at11 Especially for NumPy the NumPy model trainedfrom scratch with all Numpy training data has F1-score6974 However the NumPy model transferred from theMatplotlib model has F1-score 8168 (171 improvement)For the Matplotlib rarr Nump transfer even with only 116Numpy training data for fine-tuning the F1-score (7111) ofthe transferred NumPy model is still higher than the F1-score of the NumPy model trained from scratch with allNumpy training data This suggests that transfer learningmay boost the model performance even for the difficultdataset Such performance boost by transfer learning hasalso been observed in many studies [14] [44] [45] Thereason is that a target-library model can ldquoreuserdquo muchknowledge in the source-library model rather than hav-ing to learning completely new knowledge from scratch

bull Transfer learning can reduce the amount of target-library train-ing data required to obtain a high-quality model For four outof six transfer-learning pairs (NP rarr MPL MPL rarr NP NPrarr PD MPL rarr PD) the reduction ranges from 50 (12)to 875 (78) while the resulting target-library model stillhas better F1-score than the target-library model trainedfrom scratch using all target-library training data If we al-low a small degradation of F1-score (eg 3 of the F1-sore

for the target-library model trained from scratch usingall target-library training data) the reduction can go upto 938 (1516) For example for Matplotlib rarr Pandasusing 116 Pandas training data (ie only about 20 posts)for fine-tuning the obtained Pandas model still has F1-score 8047

bull Transfer learning is very effective in few-shot training settingsFew-shot training refers to training a model with onlya very small amount of data [46] In our experimentsusing 132 training data ie about 10 posts for transferlearning the F1-score of the obtained target-library modelis still improved a few points for NP rarr MPL PD rarr MPLand MPL rarr NP and is significantly improved for MPL rarrPD (529) and NP rarr PD (583) compared with directlyreusing source-library-trained model without fine-tuning(DU row) Furthermore the averaged results of NP rarrPD and MPL rarr PD for 10 randomly selected 116 and132 training data are similar to the results in Table 8and the variances are all smaller than 03 which showsour training data is unbiased Although the target-librarymodel trained from scratch still has reasonable F1-scorewith 12 to 18 training data they become completelyuseless with 116 or 132 training data In contrast thetarget-library models obtained through transfer learninghave significant better F1-score in such few shot settingsFurthermore training a model from scratch using few-shot data may result in abnormal increase in precision(eg MPL at 116 and 132 NP at 132 PD at 132) or inrecall (eg NP at 116) compared with training the modelfrom scratch using more data This abnormal increase inprecision (or recall) comes with a sharp decrease in recall(or precision) Our analysis shows that this phenomenonis caused by the biased training data in few-shot settingsIn contrast the target-library models obtained throughtransfer learning can reuse knowledge in the source modeland produce a much more balanced precision and recall(thus much better F1-score) even in the face of biased few-shot training data

bull The effectiveness of transfer learning does not correlate withthe quality of source-library-trained model Based on a not-so-high-quality source model (eg NumPy) we can still ob-tain a high-quality target model through transfer learningFor example NP rarr PD results in a Pandas model with F1-score gt 80 using only 18 of Pandas training data On theother hand a high-quality source model (eg Pandas) maynot boost the performance of the target model Amongall 36 transfer settings we have one such case ie PDrarr MPL at 11 The Matplotlib model transferred fromthe Pandas model using all Matplotlib training data isslightly worse than the Matplotlib model trained fromscratch using all training data This can be attributed tothe differences of API naming characteristics between thetwo libraries (see the last bullet)

bull The more similar the functionalities and the API-namingmention characteristics between the two libraries arethe more effectiveness the transfer learning can be Pan-das and NumPy have similar functionalities and API-namingmention characteristics For PD rarr NP and NPrarr PD the target-library model is less than 5 worse inF1-score with 18 target-library training data comparedwith the target-library model trained from scratch using

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 12

all training data In contrast PD rarr MPL has 69 dropin F1-score with 12 training data and NP rarr MPL has107 drop in F1-score with 14 training data comparedwith the Matplotlib model trained from scratch using alltraining data

bull Knowledge about non-polysemous APIs seems easy to expandwhile knowledge about polysemous APIs seems difficult toadapt Matplotlib has the least polysemous APIs (16)and Pandas and NumPy has much more polysemous APIsBoth MPL rarr PD and MPL rarr NP have high-qualitytarget models even with 18 target-library training dataThe transfer learning seems to be able to add the newknowledge ldquocommon words can be APIsrdquo to the target-library model In contrast NP rarr MPL requires at 12training data to have a high-quality target model (F1-score8316) and for PD rarr MPL the Matplotlib model (F1-score8058) by transfer learning with all training data is evenslightly worse than the model trained from scratch (F1-score 827) These results suggest that it can be difficultto adapt the knowledge ldquocommon words can be APIsrdquolearned from NumPy and Pandas to Matplotlib which doesnot have many common-word APIs

Transfer learning can effectively boost the performance of target-library model with less demand on target-library training dataIts effectiveness correlates more with the similarity of libraryfunctionalities and and API-namingmention characteristicsthan the initial quality of source-library model

65 Effectiveness of Across-Language Transfer Learn-ing (RQ4)Motivation In the RQ3 we investigate the transferabilityof API extraction model across three Python libraries Inthe RQ4 we want to further investigate the transferabilityof API extraction model in a more challenging setting ieacross different programming languages and across librarieswith very different functionalitiesApproach For the source language we choose Python oneof the most popular programming languages We randomlyselect 200 posts in each dataset of the three Python librariesand combine these 600 posts as the training data for PythonAPI extraction model For the target languages we choosethree other popular programming languages Java JavaScriptand C That is we have three transfer-learning pairs in thisRQ Python rarr Java Python rarr JavaScript and Python rarrC For the three target languages we intentionally choosethe libraries that support very different functionalities fromthe three Python libraries For Java we choose JDBC (anAPI for accessing relational database) For JavaScript wechoose React (a library for web graphical user interface)For C we choose OpenGL (a library for 3D graphics) Asdescribed in Section 612 we label 600 posts for each target-language library for the experiments As in the RQ3 weuse gradually-reduced target-language training data to fine-tune the source-language-trained modelResults Table 10 Table 11 and Table 12 show the experimentresults for the three across-language transfer-learning pairsThese three tables use the same notation as Table 6 Table 7and Table 8 We can see thatbull Directly reusing the source-language-trained model on the

target-language text produces unacceptable performance For

TABLE 10 Python rarr Java

PythonrarrJava JavaPrec Recall F1 Prec Recall F1

11 7745 6656 7160 8469 5531 669212 7220 6250 6706 7738 5344 632214 6926 5000 5808 7122 4719 567718 5000 6406 5616 7515 3969 5194

116 5571 4825 5200 7569 3406 4698132 5699 4000 4483 7789 2312 3566DU 4444 2875 3491

TABLE 11 Python rarr JavaScript

PythonrarrJavaScript JavaScriptPrec Recall F1 Prec Recall F1

11 7745 8175 8619 8795 7817 827712 8693 7659 8143 8756 5984 777014 8684 6865 7424 8308 6626 737218 8148 6111 6984 8598 5595 6788

116 7111 6349 6808 8738 3848 5344132 6667 5238 5867 6521 3571 4615DU 5163 2500 3369

the three Python libraries directly reusing a source-library-trained model on the text of a target-library maystill have reasonable performance for example NP rarrMPL (F1-score 6916) PD rarr MPL (640) MPL rarr NP(6471) However for the three across-language transfer-learning pairs the F1-score of direct model reuse is verylow Python rarr Java (3491) Python rarr JavaScript (3369)Python rarr C (3595) These results suggest that there arestill certain level of commonalities between the librarieswith similar functionalities and of the same programminglanguage so that the knowledge learned in the source-library model may be largely reused in the target-librarymodel In contrast the libraries of different functionali-ties and programming languages have much fewer com-monalities which makes it infeasible to directly deploya source-language-trained model to the text of anotherlanguage In such cases we have to either train the modelfor each language or library from scratch or we may ex-ploit transfer learning to adapt a source-language-trainedmodel to the target-language text

bull Across-language transfer learning holds the same performancecharacteristics as within-language transfer learning but itdemands more target-language training data to obtain a high-quality model than within-language transfer learning ForPython rarr JavaScript and Python rarr C with as minimumas 116 target-language training data we can obtaina fine-tuned target-language model whose F1-score candouble that of directly reusing the Python-trained modelto JavaScript or C text For Python rarr Java fine-tuning thePython-trained model with 116 Java training data booststhe F1-score by 50 compared with directly applying thePython-trained model to Java text For all the transfer-learning settings in Table 10 Table 11 and Table 12the fine-tuned target-language model always outperformsthe corresponding target-language model trained fromscratch with the same amount of target-language trainingdata However it requires at least 12 target-languagetraining data to fine-tune a target-language model toachieve the same level of performance as the target-language model trained from scratch with all target-

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 13

TABLE 12 Python rarr C

PythonrarrC CPrec Recall F1 Prec Recall F1

11 8987 8509 8747 8583 8374 847712 8535 8373 8454 8628 7669 812114 8383 7588 7965 8219 7127 763418 7432 7371 7401 7868 6802 7297

116 7581 6945 7260 8852 5710 6942132 6962 6585 6779 8789 4525 5975DU 5649 2791 3595

language training data For the within-language transferlearning achieving this result may require as minimum as18 target-library training data

Software text of different programming languages and of li-braries with very different functionalities increases the difficultyof transfer learning but transfer learning can still effectivelyboost the performance of the target-language model with 12less target-language training data compared with training thetarget-language model from scratch with all training data

66 Threats to Validity

The major threat to internal validity is the API labelingerrors in the dataset In order to decrease errors the first twoauthors first independently label the same data and thenresolve the disagreements by discussion Furthermore inour experiments we have to examine many API extractionresults We only spot a few instances of overlooked orerroneously-labeled API mentions Therefore the quality ofour dataset is trustworthy

The major threat to external validity is the generalizationof our results and findings Although we invest significanttime and effort to prepare datasets conduct experimentsand analyze results our experiments involve only six li-braries of four programming languages The performance ofour neural architecture and especially the findings on trans-fer learning could be different with other programminglanguages and libraries Furthermore the software text inall the experiments comes from a single data source ieStack Overflow Software text from other data sources mayexhibit different API-mention and discussion-context char-acteristics which may affect the model performance In thefuture we will reduce this threat by applying our approachto more languageslibraries and informal software text fromother data sources (eg developer emails)

7 RELATED WORK

APIs are the core resources for software development andAPI related knowledge is widely present in software textssuch as API reference documentation QampA discussionsbug reports Researchers have proposed many API extrac-tion methods to support various software engineering tasksespecially document traceability recovery [42] [47] [48][49] [50] For example RecoDoc [3] extracts Java APIsfrom several resources and then recover traceability acrossdifferent sources Subramanian et al [4] use code contextinformation to filter candidate APIs in a knowledge basefor an API mention in a partial code fragment Christophand Robillard [51] extracts sentences about API usage from

Stack Overflow and use them to augment API referencedocumentation These works focus mainly on the API link-ing task ie linking API mentions in text or code to someAPI entities in a knowledge base In terms of extracting APImentions in software text they rely on rule-based methodsfor example regular expressions of distinct orthographicfeatures of APIs such as camelcase special characters (eg or ()) and API annotations

Several studies such as Bacchelli et al [42] [43] and Yeet al [2] show that regular expressions of distinct ortho-graphic features are not reliable for API extraction tasks ininformal texts such as emails Stack Overflow posts Bothvariations of sentence formats and the wide presence ofmentions of polysemous API simple names pose a greatchallenge for API extraction in informal texts Howeverthis challenge is generally avoided by considering onlyAPI mentions with distinct orthographic features in existingworks [3] [51] [52]

Island parsing provides a more robust solution for ex-tracting API mentions from texts Using an island grammarwe can separate the textual content into constructs of in-terest (island) and the remainder (water) [53] For exampleBacchelli et al [54] uses island parsing to extract code frag-ments from natural language text Rigby and Robillard [52]also use island parser to identify code-like elements that canpotentially be APIs However these island parsers cannoteffectively deal with mentions of API simple names that arenot suffixed by () such as apply and series in Fig 1 whichare very common writing forms of API methods in StackOverflow discussions [2]

Recently Ye et al [10] proposes a CRF based approachfor extracting methods of software entities such as pro-gramming languages libraries computing concepts andAPIs from informal software texts They report that extractAPI mentions is much more challenging than extractingother types of software entities A follow-up work by Yeet al [2] proposes to use a combination of orthographicword-clusters and API gazetteer features to enhance theCRFrsquos performance on API extraction tasks Although theseproposed features are effective they require much manualeffort to develop and tune Furthermore enough trainingdata has to be prepared and manually labeled for applyingtheir approach to the software text of each library to beprocessed These overheads pose a practical limitation todeploying their approach to a larger number of libraries

Our neural architecture is inspired by recent advancesof neural network techniques for NLP tasks For exampleboth RNNs and CNNs have been used to embed character-level features for question answering [55] [56] machinetranslation [57] text classification [58] and part-of-speechtagging [59] Some researchers also use word embeddingsand LSTMs for NER [35] [60] [61] To the best of our knowl-edge our neural architecture is the first machine learningbased API extraction method that combines these proposalsand customize them based on the characteristics of soft-ware texts and API names Furthermore the design of ourneural architecture also takes into account the deploymentoverhead of the API methods for multiple programminglanguages and libraries which has never been explored forAPI extraction tasks

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 14

8 CONCULSION

This paper presents a novel multi-layer neural architecturethat can effectively learn important character- word- andsentence-level features from informal software texts for ex-tracting API mentions in text The learned features has supe-rior performance than human-defined orthographic featuresfor API extraction in informal software texts This makes ourneural architecture easy to deploy because the only input itrequires is the texts to be processed In contrast existingmachine learning based API extraction methods have touse additional hand-crafted features such as word clustersor API gazetteers in order to achieve the performanceclose to that of our neural architecture Furthermore as thefeatures are automatically learned from the input texts ourneural architecture is easy to transfer and fine-tune acrossprogramming languages and libraries We demonstrate itstransferability across three Python libraries and across fourprogramming languages Our neural architecture togetherwith transfer learning makes it easy to train and deploya high-quality API extraction model for multiple program-ming languages and libraries with much less overall effortrequired for preparing training data and effective featuresIn the future we will further investigate the performanceand the transferability of our neural architecture in manyother programming languages and libraries moving to-wards real-world deployment of machine learning basedAPI extraction methods

REFERENCES

[1] C Parnin C Treude L Grammel and M-A Storey ldquoCrowddocumentation Exploring the coverage and the dynamics of apidiscussions on stack overflowrdquo

[2] D Ye Z Xing C Y Foo J Li and N Kapre ldquoLearning to extractapi mentions from informal natural language discussionsrdquo in Soft-ware Maintenance and Evolution (ICSME) 2016 IEEE InternationalConference on IEEE 2016 pp 389ndash399

[3] B Dagenais and M P Robillard ldquoRecovering traceability linksbetween an api and its learning resourcesrdquo in Software Engineering(ICSE) 2012 34th International Conference on IEEE 2012 pp 47ndash57

[4] S Subramanian L Inozemtseva and R Holmes ldquoLive api doc-umentationrdquo in Proceedings of the 36th International Conference onSoftware Engineering ACM 2014 pp 643ndash652

[5] C Chen Z Xing and X Wang ldquoUnsupervised software-specificmorphological forms inference from informal discussionsrdquo in Pro-ceedings of the 39th International Conference on Software EngineeringIEEE Press 2017 pp 450ndash461

[6] X Chen C Chen D Zhang and Z Xing ldquoSethesaurus Wordnetin software engineeringrdquo IEEE Transactions on Software Engineer-ing 2019

[7] H Li S Li J Sun Z Xing X Peng M Liu and X Zhao ldquoIm-proving api caveats accessibility by mining api caveats knowledgegraphrdquo in Software Maintenance and Evolution (ICSME) 2018 IEEEInternational Conference on IEEE 2018

[8] Z X D L X W Qiao Huang Xin Xia ldquoApi method recom-mendation without worrying about the task-api knowledge gaprdquoin Automated Software Engineering (ASE) 2018 33th IEEEACMInternational Conference on IEEE 2018

[9] B Xu Z Xing X Xia and D Lo ldquoAnswerbot automated gen-eration of answer summary to developersz technical questionsrdquoin Proceedings of the 32nd IEEEACM International Conference onAutomated Software Engineering IEEE Press 2017 pp 706ndash716

[10] D Ye Z Xing C Y Foo Z Q Ang J Li and N Kapre ldquoSoftware-specific named entity recognition in software engineering socialcontentrdquo in 2016 IEEE 23rd International Conference on SoftwareAnalysis Evolution and Reengineering (SANER) vol 1 IEEE 2016pp 90ndash101

[11] E F Tjong Kim Sang and F De Meulder ldquoIntroduction to the conll-2003 shared task Language-independent named entity recogni-tionrdquo in Proceedings of the seventh conference on Natural languagelearning at HLT-NAACL 2003-Volume 4 Association for Compu-tational Linguistics 2003 pp 142ndash147

[12] R Caruana ldquoLearning many related tasks at the same time withbackpropagationrdquo in Advances in neural information processing sys-tems 1995 pp 657ndash664

[13] A Arnold R Nallapati and W W Cohen ldquoExploiting featurehierarchy for transfer learning in named entity recognitionrdquo Pro-ceedings of ACL-08 HLT pp 245ndash253 2008

[14] J Yosinski J Clune Y Bengio and H Lipson ldquoHow transferableare features in deep neural networksrdquo in Advances in neuralinformation processing systems 2014 pp 3320ndash3328

[15] Z X Deheng Ye Lingfeng Bao and S-W Lin ldquoApireal An apirecognition and linking approach for online developer forumsrdquoEmpirical Software Engineering 2018 2018

[16] C dos Santos and M Gatti ldquoDeep convolutional neural networksfor sentiment analysis of short textsrdquo in Proceedings of COLING2014 the 25th International Conference on Computational LinguisticsTechnical Papers 2014 pp 69ndash78

[17] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[18] G Chen C Chen Z Xing and B Xu ldquoLearning a dual-languagevector space for domain-specific cross-lingual question retrievalrdquoin 2016 31st IEEEACM International Conference on Automated Soft-ware Engineering (ASE) IEEE 2016 pp 744ndash755

[19] C Chen X Chen J Sun Z Xing and G Li ldquoData-driven proac-tive policy assurance of post quality in community qampa sitesrdquoProceedings of the ACM on human-computer interaction vol 2 noCSCW p 33 2018

[20] I Bazzi ldquoModelling out-of-vocabulary words for robust speechrecognitionrdquo PhD dissertation Massachusetts Institute of Tech-nology 2002

[21] C Chen S Gao and Z Xing ldquoMining analogical libraries in qampadiscussionsndashincorporating relational and categorical knowledgeinto word embeddingrdquo in 2016 IEEE 23rd international conferenceon software analysis evolution and reengineering (SANER) vol 1IEEE 2016 pp 338ndash348

[22] C Chen and Z Xing ldquoSimilartech automatically recommendanalogical libraries across different programming languagesrdquo in2016 31st IEEEACM International Conference on Automated SoftwareEngineering (ASE) IEEE 2016 pp 834ndash839

[23] C Chen Z Xing and Y Liu ldquoWhats spains paris mining analogi-cal libraries from qampa discussionsrdquo Empirical Software Engineeringvol 24 no 3 pp 1155ndash1194 2019

[24] Y Huang C Chen Z Xing T Lin and Y Liu ldquoTell them apartdistilling technology differences from crowd-scale comparisondiscussionsrdquo in ASE 2018 pp 214ndash224

[25] O Levy and Y Goldberg ldquoNeural word embedding as implicitmatrix factorizationrdquo in Advances in neural information processingsystems 2014 pp 2177ndash2185

[26] D Tang F Wei N Yang M Zhou T Liu and B Qin ldquoLearningsentiment-specific word embedding for twitter sentiment classifi-cationrdquo in Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) vol 1 2014pp 1555ndash1565

[27] O Levy Y Goldberg and I Dagan ldquoImproving distributional sim-ilarity with lessons learned from word embeddingsrdquo Transactionsof the Association for Computational Linguistics vol 3 pp 211ndash2252015

[28] J Pennington R Socher and C Manning ldquoGlove Global vectorsfor word representationrdquo in Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP) 2014 pp1532ndash1543

[29] I Sutskever O Vinyals and Q V Le ldquoSequence to sequencelearning with neural networksrdquo in Advances in neural informationprocessing systems 2014 pp 3104ndash3112

[30] C Chen Z Xing and Y Liu ldquoBy the community amp for the com-munity a deep learning approach to assist collaborative editing inqampa sitesrdquo Proceedings of the ACM on Human-Computer Interactionvol 1 no CSCW p 32 2017

[31] S Gao C Chen Z Xing Y Ma W Song and S-W Lin ldquoA neuralmodel for method name generation from functional descriptionrdquo

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 15

in 2019 IEEE 26th International Conference on Software AnalysisEvolution and Reengineering (SANER) IEEE 2019 pp 414ndash421

[32] X Wang C Chen and Z Xing ldquoDomain-specific machine trans-lation with recurrent neural network for software localizationrdquoEmpirical Software Engineering pp 1ndash32 2019

[33] C Chen Z Xing Y Liu and K L X Ong ldquoMining likelyanalogical apis across third-party libraries via large-scale unsu-pervised api semantics embeddingrdquo IEEE Transactions on SoftwareEngineering 2019

[34] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no 11pp 2673ndash2681 1997

[35] Z Huang W Xu and K Yu ldquoBidirectional lstm-crf models forsequence taggingrdquo arXiv preprint arXiv150801991 2015

[36] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural computation vol 9 no 8 pp 1735ndash1780 1997

[37] H Sak A Senior and F Beaufays ldquoLong short-term memoryrecurrent neural network architectures for large scale acousticmodelingrdquo in Fifteenth annual conference of the international speechcommunication association 2014

[38] Y Bengio ldquoDeep learning of representations for unsupervised andtransfer learningrdquo in Proceedings of ICML Workshop on Unsupervisedand Transfer Learning 2012 pp 17ndash36

[39] S J Pan Q Yang et al ldquoA survey on transfer learningrdquo[40] N Srivastava G Hinton A Krizhevsky I Sutskever and

R Salakhutdinov ldquoDropout A simple way to prevent neural net-works from overfittingrdquo The Journal of Machine Learning Researchvol 15 no 1 pp 1929ndash1958 2014

[41] D P Kingma and J Ba ldquoAdam A method for stochastic optimiza-tionrdquo arXiv preprint arXiv14126980 2014

[42] A Bacchelli M Lanza and R Robbes ldquoLinking e-mails andsource code artifactsrdquo in Proceedings of the 32nd ACMIEEE Inter-national Conference on Software Engineering-Volume 1 ACM 2010pp 375ndash384

[43] A Bacchelli M DrsquoAmbros M Lanza and R Robbes ldquoBench-marking lightweight techniques to link e-mails and source coderdquoin Reverse Engineering 2009 WCRErsquo09 16th Working Conference onIEEE 2009 pp 205ndash214

[44] S J Pan and Q Yang ldquoA survey on transfer learningrdquo IEEETransactions on knowledge and data engineering vol 22 no 10 pp1345ndash1359 2010

[45] M Oquab L Bottou I Laptev and J Sivic ldquoLearning and transfer-ring mid-level image representations using convolutional neuralnetworksrdquo in Computer Vision and Pattern Recognition (CVPR) 2014IEEE Conference on IEEE 2014 pp 1717ndash1724

[46] S Ravi and H Larochelle ldquoOptimization as a model for few-shotlearningrdquo 2016

[47] A Marcus J Maletic et al ldquoRecovering documentation-to-source-code traceability links using latent semantic indexingrdquo in Soft-ware Engineering 2003 Proceedings 25th International Conference onIEEE 2003 pp 125ndash135

[48] H-Y Jiang T N Nguyen X Chen H Jaygarl and C K ChangldquoIncremental latent semantic indexing for automatic traceabil-ity link evolution managementrdquo in Proceedings of the 2008 23rdIEEEACM International Conference on Automated Software Engineer-ing IEEE Computer Society 2008 pp 59ndash68

[49] W Zheng Q Zhang and M Lyu ldquoCross-library api recommen-dation using web search enginesrdquo in Proceedings of the 19th ACMSIGSOFT symposium and the 13th European conference on Foundationsof software engineering ACM 2011 pp 480ndash483

[50] Q Gao H Zhang J Wang Y Xiong L Zhang and H Mei ldquoFixingrecurring crash bugs via analyzing qampa sites (t)rdquo in AutomatedSoftware Engineering (ASE) 2015 30th IEEEACM International Con-ference on IEEE 2015 pp 307ndash318

[51] C Treude and M P Robillard ldquoAugmenting api documentationwith insights from stack overflowrdquo in Software Engineering (ICSE)2016 IEEEACM 38th International Conference on IEEE 2016 pp392ndash403

[52] P C Rigby and M P Robillard ldquoDiscovering essential codeelements in informal documentationrdquo in Proceedings of the 2013International Conference on Software Engineering IEEE Press 2013pp 832ndash841

[53] L Moonen ldquoGenerating robust parsers using island grammarsrdquoin Reverse Engineering 2001 Proceedings Eighth Working Conferenceon IEEE 2001 pp 13ndash22

[54] A Bacchelli A Cleve M Lanza and A Mocci ldquoExtracting struc-tured data from natural language documents with island parsingrdquo

in Automated Software Engineering (ASE) 2011 26th IEEEACMInternational Conference on IEEE 2011 pp 476ndash479

[55] Y Kim Y Jernite D Sontag and A M Rush ldquoCharacter-awareneural language modelsrdquo in AAAI 2016 pp 2741ndash2749

[56] D Lukovnikov A Fischer J Lehmann and S Auer ldquoNeuralnetwork-based question answering over knowledge graphs onword and character levelrdquo in Proceedings of the 26th internationalconference on World Wide Web International World Wide WebConferences Steering Committee 2017 pp 1211ndash1220

[57] W Ling I Trancoso C Dyer and A W Black ldquoCharacter-basedneural machine translationrdquo arXiv preprint arXiv151104586 2015

[58] X Zhang J Zhao and Y LeCun ldquoCharacter-level convolutionalnetworks for text classificationrdquo in Advances in neural informationprocessing systems 2015 pp 649ndash657

[59] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[60] J P Chiu and E Nichols ldquoNamed entity recognition with bidirec-tional lstm-cnnsrdquo arXiv preprint arXiv151108308 2015

[61] G Lample M Ballesteros S Subramanian K Kawakami andC Dyer ldquoNeural architectures for named entity recognitionrdquoarXiv preprint arXiv160301360 2016

Suyu Ma is a research assistant in the Facultyof Information Technology at Monash UniversityHe has research interest in the areas of soft-ware engineering Deep learning and Human-computer Interaction He is currently focusing onimproving the usability and accessibility of mo-bile application He received the BS and MSdegrees from Beijing Technology and BusinessUniversity and the Australian National Universityin 2016 and 2018 respectively And he will bea PhD student at Monash University in 2020

under the supervision of Chunyang Chen

Zhenchang Xing is a Senior Lecturer in the Re-search School of Computer Science AustralianNational University Previously he was an Assis-tant Professor in the School of Computer Sci-ence and Engineering Nanyang TechnologicalUniversity Singapore from 2012-2016 Beforejoining NTU Dr Xing was a Lee Kuan Yew Re-search Fellow in the School of Computing Na-tional University of Singapore from 2009-2012Dr Xings current research area is in the interdis-plinary areas of software engineering human-

computer interaction and applied data analytics Dr Xing has over 100publications in peer-refereed journals and conference proceedings andhave received several distinguished paper awards from top softwareengineering conferences Dr Xing regularly serves on the organizationand program committees of the top software engineering conferencesand he will be the program committee co-chair for ICSME2020

Chunyang Chen is a lecturer (Assistant Pro-fessor) in Faculty of Information TechnologyMonash University Australia His research fo-cuses on software engineering deep learningand human-computer interaction He has pub-lished over 16 papers in referred journals orconferences including Empirical Software Engi-neering ICSE ASE CSCW ICSME SANERHe is a member of IEEE and ACM He receivedACM SIGSOFT distinguished paper award inASE 2018 best paper award in SANER 2016

and best tool demo in ASE 2016 httpschunyang-chengithubio

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 16

Cheng Chen received his Bachelor degree inSoftware Engineering from Northwest University(China) and received Master degree of Com-puting (major in Artificial Intelligence) from theAustralian National University He did some in-ternship projects about Natural Language Pro-cessing at CSIRO from 2016 to 2017 He workedas a research assistant supervised by Dr Zhen-chang Xing at ANU in 2018 Cheng currentlyworks in PricewaterhouseCoopers (PwC) Firmas a senior algorithm engineer of Natural Lan-

guage Processing and Data Analyzing Cheng is interested in NamedEnti ty Extraction Relation Extraction Text summarization and parallelcomputing He is working on knowledge engineering and transaction forNLP tasks

Lizhen Qu is a research fellow in the DialogueResearch lab in the Faculty of Information Tech-nology at Monash University He has exten-sive research experience in the areas of naturallanguage processing multimodal learning deeplearning and Cybersecurity He is currently fo-cusing on information extraction semantic pars-ing and multimodal dialogue systems Prior tojoining Monash University he worked as a re-search scientist at Data61CSIRO where he ledand participated in several research and indus-

trial projects including Deep Learning for Text and Deep Learning forCyber

Guoqiang Li is now an associate professor inschool of software Shanghai Jiao Tong Univer-sity and a guest associate professor in KyushuUniversity He received the BS MS and PhDdegrees from Taiyuan University of TechnologyShanghai Jiao Tong University and Japan Ad-vanced Institute of Science and Technology in2001 2005 and 2008 respectively He workedas a postdoctoral research fellow in the graduateschool of information science Nagoya Univer-sity Japan during 2008-2009 as an assistant

professor in the school of software Shanghai Jiao Tong Universityduring 2009-2013 and as an academic visitor in the department ofcomputer science University of Oxford during 2015-2016 His researchinterests include formal verification programming language theory anddata analytics and intelligence He published more than 40 researchespapers in the international journals and mainstream conferences includ-ing TDSC SPE TECS IJFCS SCIS CSCW ICSE FORMATS ATVAetc

Page 11: JOURNAL OF LA Easy-to-Deploy API Extraction by Multi-Level ... · sufficient high-quality training data for APIs of some less frequently discussed libraries or frameworks. Another

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 11

TABLE 8 Matplotlib (MPL) or NumPy (NP) rarr Pandas (PD)MPLrarrPD NPrarrPD PD

Prec Recall F1 Prec Recall F1 Prec Recall F111 8497 8962 8723 8686 8761 8723 8043 8530 828012 8618 8443 8530 8397 8300 8348 8801 7406 804314 8107 8761 8421 8107 8270 8188 7650 7695 767218 8754 8098 8413 8576 7637 8079 6930 8329 7565

116 8204 7896 8047 8245 7579 7898 8421 5072 5331132 8131 7522 7814 8143 7205 7645 8425 3084 4514DU 7165 4006 5139 7500 3545 4814

TABLE 9 Averaged Matplotlib (MPL) or NumPy(NP)rarrPandas (PD)

MPLrarrPD NPrarrPDPrec Recall F1 Prec Recall F1

116 8333 7781 8043 8339 7522 7909132 7950 7378 7653 8615 7115 7830

model directly without any fine-tuning (ie 01 target-library data) as a baseline We also randomly select 116 and132 training data for NumPy rarr Pandas and Matplotlib rarrPandas for 10 times and train the model for each time Thenwe calculate the averaged precision and recall values Basedon the averaged precision and recall values we calculate theaveraged F1-scoreResults Table 6 Table 7 and Table 8 show the experimentresults of the six pairs of transfer learning Table 9 showsthe averaged experiment results of NumPy rarr Pandas andMatplotlib rarr Pandas with 116 and 132 training dataAcronyms stand for MPL (Matplotlib) NP (NumPy) PD(Pandas) DU (Direct Use) The last column is the results oftraining the target-library model from scratch We can seethatbull Transfer learning can produce a better-performance target-

library model than training the target-library model fromscratch with the same proportion of training data This is ev-ident as the F1-score of the target-library model obtainedby fine-tuning the source-library-trained model is higherthan the F1-score of the target-library model trained fromscratch in all the transfer settings but PD rarr MPL at11 Especially for NumPy the NumPy model trainedfrom scratch with all Numpy training data has F1-score6974 However the NumPy model transferred from theMatplotlib model has F1-score 8168 (171 improvement)For the Matplotlib rarr Nump transfer even with only 116Numpy training data for fine-tuning the F1-score (7111) ofthe transferred NumPy model is still higher than the F1-score of the NumPy model trained from scratch with allNumpy training data This suggests that transfer learningmay boost the model performance even for the difficultdataset Such performance boost by transfer learning hasalso been observed in many studies [14] [44] [45] Thereason is that a target-library model can ldquoreuserdquo muchknowledge in the source-library model rather than hav-ing to learning completely new knowledge from scratch

bull Transfer learning can reduce the amount of target-library train-ing data required to obtain a high-quality model For four outof six transfer-learning pairs (NP rarr MPL MPL rarr NP NPrarr PD MPL rarr PD) the reduction ranges from 50 (12)to 875 (78) while the resulting target-library model stillhas better F1-score than the target-library model trainedfrom scratch using all target-library training data If we al-low a small degradation of F1-score (eg 3 of the F1-sore

for the target-library model trained from scratch usingall target-library training data) the reduction can go upto 938 (1516) For example for Matplotlib rarr Pandasusing 116 Pandas training data (ie only about 20 posts)for fine-tuning the obtained Pandas model still has F1-score 8047

bull Transfer learning is very effective in few-shot training settingsFew-shot training refers to training a model with onlya very small amount of data [46] In our experimentsusing 132 training data ie about 10 posts for transferlearning the F1-score of the obtained target-library modelis still improved a few points for NP rarr MPL PD rarr MPLand MPL rarr NP and is significantly improved for MPL rarrPD (529) and NP rarr PD (583) compared with directlyreusing source-library-trained model without fine-tuning(DU row) Furthermore the averaged results of NP rarrPD and MPL rarr PD for 10 randomly selected 116 and132 training data are similar to the results in Table 8and the variances are all smaller than 03 which showsour training data is unbiased Although the target-librarymodel trained from scratch still has reasonable F1-scorewith 12 to 18 training data they become completelyuseless with 116 or 132 training data In contrast thetarget-library models obtained through transfer learninghave significant better F1-score in such few shot settingsFurthermore training a model from scratch using few-shot data may result in abnormal increase in precision(eg MPL at 116 and 132 NP at 132 PD at 132) or inrecall (eg NP at 116) compared with training the modelfrom scratch using more data This abnormal increase inprecision (or recall) comes with a sharp decrease in recall(or precision) Our analysis shows that this phenomenonis caused by the biased training data in few-shot settingsIn contrast the target-library models obtained throughtransfer learning can reuse knowledge in the source modeland produce a much more balanced precision and recall(thus much better F1-score) even in the face of biased few-shot training data

bull The effectiveness of transfer learning does not correlate withthe quality of source-library-trained model Based on a not-so-high-quality source model (eg NumPy) we can still ob-tain a high-quality target model through transfer learningFor example NP rarr PD results in a Pandas model with F1-score gt 80 using only 18 of Pandas training data On theother hand a high-quality source model (eg Pandas) maynot boost the performance of the target model Amongall 36 transfer settings we have one such case ie PDrarr MPL at 11 The Matplotlib model transferred fromthe Pandas model using all Matplotlib training data isslightly worse than the Matplotlib model trained fromscratch using all training data This can be attributed tothe differences of API naming characteristics between thetwo libraries (see the last bullet)

bull The more similar the functionalities and the API-namingmention characteristics between the two libraries arethe more effectiveness the transfer learning can be Pan-das and NumPy have similar functionalities and API-namingmention characteristics For PD rarr NP and NPrarr PD the target-library model is less than 5 worse inF1-score with 18 target-library training data comparedwith the target-library model trained from scratch using

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 12

all training data In contrast PD rarr MPL has 69 dropin F1-score with 12 training data and NP rarr MPL has107 drop in F1-score with 14 training data comparedwith the Matplotlib model trained from scratch using alltraining data

bull Knowledge about non-polysemous APIs seems easy to expandwhile knowledge about polysemous APIs seems difficult toadapt Matplotlib has the least polysemous APIs (16)and Pandas and NumPy has much more polysemous APIsBoth MPL rarr PD and MPL rarr NP have high-qualitytarget models even with 18 target-library training dataThe transfer learning seems to be able to add the newknowledge ldquocommon words can be APIsrdquo to the target-library model In contrast NP rarr MPL requires at 12training data to have a high-quality target model (F1-score8316) and for PD rarr MPL the Matplotlib model (F1-score8058) by transfer learning with all training data is evenslightly worse than the model trained from scratch (F1-score 827) These results suggest that it can be difficultto adapt the knowledge ldquocommon words can be APIsrdquolearned from NumPy and Pandas to Matplotlib which doesnot have many common-word APIs

Transfer learning can effectively boost the performance of target-library model with less demand on target-library training dataIts effectiveness correlates more with the similarity of libraryfunctionalities and and API-namingmention characteristicsthan the initial quality of source-library model

65 Effectiveness of Across-Language Transfer Learn-ing (RQ4)Motivation In the RQ3 we investigate the transferabilityof API extraction model across three Python libraries Inthe RQ4 we want to further investigate the transferabilityof API extraction model in a more challenging setting ieacross different programming languages and across librarieswith very different functionalitiesApproach For the source language we choose Python oneof the most popular programming languages We randomlyselect 200 posts in each dataset of the three Python librariesand combine these 600 posts as the training data for PythonAPI extraction model For the target languages we choosethree other popular programming languages Java JavaScriptand C That is we have three transfer-learning pairs in thisRQ Python rarr Java Python rarr JavaScript and Python rarrC For the three target languages we intentionally choosethe libraries that support very different functionalities fromthe three Python libraries For Java we choose JDBC (anAPI for accessing relational database) For JavaScript wechoose React (a library for web graphical user interface)For C we choose OpenGL (a library for 3D graphics) Asdescribed in Section 612 we label 600 posts for each target-language library for the experiments As in the RQ3 weuse gradually-reduced target-language training data to fine-tune the source-language-trained modelResults Table 10 Table 11 and Table 12 show the experimentresults for the three across-language transfer-learning pairsThese three tables use the same notation as Table 6 Table 7and Table 8 We can see thatbull Directly reusing the source-language-trained model on the

target-language text produces unacceptable performance For

TABLE 10 Python rarr Java

PythonrarrJava JavaPrec Recall F1 Prec Recall F1

11 7745 6656 7160 8469 5531 669212 7220 6250 6706 7738 5344 632214 6926 5000 5808 7122 4719 567718 5000 6406 5616 7515 3969 5194

116 5571 4825 5200 7569 3406 4698132 5699 4000 4483 7789 2312 3566DU 4444 2875 3491

TABLE 11 Python rarr JavaScript

PythonrarrJavaScript JavaScriptPrec Recall F1 Prec Recall F1

11 7745 8175 8619 8795 7817 827712 8693 7659 8143 8756 5984 777014 8684 6865 7424 8308 6626 737218 8148 6111 6984 8598 5595 6788

116 7111 6349 6808 8738 3848 5344132 6667 5238 5867 6521 3571 4615DU 5163 2500 3369

the three Python libraries directly reusing a source-library-trained model on the text of a target-library maystill have reasonable performance for example NP rarrMPL (F1-score 6916) PD rarr MPL (640) MPL rarr NP(6471) However for the three across-language transfer-learning pairs the F1-score of direct model reuse is verylow Python rarr Java (3491) Python rarr JavaScript (3369)Python rarr C (3595) These results suggest that there arestill certain level of commonalities between the librarieswith similar functionalities and of the same programminglanguage so that the knowledge learned in the source-library model may be largely reused in the target-librarymodel In contrast the libraries of different functionali-ties and programming languages have much fewer com-monalities which makes it infeasible to directly deploya source-language-trained model to the text of anotherlanguage In such cases we have to either train the modelfor each language or library from scratch or we may ex-ploit transfer learning to adapt a source-language-trainedmodel to the target-language text

bull Across-language transfer learning holds the same performancecharacteristics as within-language transfer learning but itdemands more target-language training data to obtain a high-quality model than within-language transfer learning ForPython rarr JavaScript and Python rarr C with as minimumas 116 target-language training data we can obtaina fine-tuned target-language model whose F1-score candouble that of directly reusing the Python-trained modelto JavaScript or C text For Python rarr Java fine-tuning thePython-trained model with 116 Java training data booststhe F1-score by 50 compared with directly applying thePython-trained model to Java text For all the transfer-learning settings in Table 10 Table 11 and Table 12the fine-tuned target-language model always outperformsthe corresponding target-language model trained fromscratch with the same amount of target-language trainingdata However it requires at least 12 target-languagetraining data to fine-tune a target-language model toachieve the same level of performance as the target-language model trained from scratch with all target-

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 13

TABLE 12 Python rarr C

PythonrarrC CPrec Recall F1 Prec Recall F1

11 8987 8509 8747 8583 8374 847712 8535 8373 8454 8628 7669 812114 8383 7588 7965 8219 7127 763418 7432 7371 7401 7868 6802 7297

116 7581 6945 7260 8852 5710 6942132 6962 6585 6779 8789 4525 5975DU 5649 2791 3595

language training data For the within-language transferlearning achieving this result may require as minimum as18 target-library training data

Software text of different programming languages and of li-braries with very different functionalities increases the difficultyof transfer learning but transfer learning can still effectivelyboost the performance of the target-language model with 12less target-language training data compared with training thetarget-language model from scratch with all training data

66 Threats to Validity

The major threat to internal validity is the API labelingerrors in the dataset In order to decrease errors the first twoauthors first independently label the same data and thenresolve the disagreements by discussion Furthermore inour experiments we have to examine many API extractionresults We only spot a few instances of overlooked orerroneously-labeled API mentions Therefore the quality ofour dataset is trustworthy

The major threat to external validity is the generalizationof our results and findings Although we invest significanttime and effort to prepare datasets conduct experimentsand analyze results our experiments involve only six li-braries of four programming languages The performance ofour neural architecture and especially the findings on trans-fer learning could be different with other programminglanguages and libraries Furthermore the software text inall the experiments comes from a single data source ieStack Overflow Software text from other data sources mayexhibit different API-mention and discussion-context char-acteristics which may affect the model performance In thefuture we will reduce this threat by applying our approachto more languageslibraries and informal software text fromother data sources (eg developer emails)

7 RELATED WORK

APIs are the core resources for software development andAPI related knowledge is widely present in software textssuch as API reference documentation QampA discussionsbug reports Researchers have proposed many API extrac-tion methods to support various software engineering tasksespecially document traceability recovery [42] [47] [48][49] [50] For example RecoDoc [3] extracts Java APIsfrom several resources and then recover traceability acrossdifferent sources Subramanian et al [4] use code contextinformation to filter candidate APIs in a knowledge basefor an API mention in a partial code fragment Christophand Robillard [51] extracts sentences about API usage from

Stack Overflow and use them to augment API referencedocumentation These works focus mainly on the API link-ing task ie linking API mentions in text or code to someAPI entities in a knowledge base In terms of extracting APImentions in software text they rely on rule-based methodsfor example regular expressions of distinct orthographicfeatures of APIs such as camelcase special characters (eg or ()) and API annotations

Several studies such as Bacchelli et al [42] [43] and Yeet al [2] show that regular expressions of distinct ortho-graphic features are not reliable for API extraction tasks ininformal texts such as emails Stack Overflow posts Bothvariations of sentence formats and the wide presence ofmentions of polysemous API simple names pose a greatchallenge for API extraction in informal texts Howeverthis challenge is generally avoided by considering onlyAPI mentions with distinct orthographic features in existingworks [3] [51] [52]

Island parsing provides a more robust solution for ex-tracting API mentions from texts Using an island grammarwe can separate the textual content into constructs of in-terest (island) and the remainder (water) [53] For exampleBacchelli et al [54] uses island parsing to extract code frag-ments from natural language text Rigby and Robillard [52]also use island parser to identify code-like elements that canpotentially be APIs However these island parsers cannoteffectively deal with mentions of API simple names that arenot suffixed by () such as apply and series in Fig 1 whichare very common writing forms of API methods in StackOverflow discussions [2]

Recently Ye et al [10] proposes a CRF based approachfor extracting methods of software entities such as pro-gramming languages libraries computing concepts andAPIs from informal software texts They report that extractAPI mentions is much more challenging than extractingother types of software entities A follow-up work by Yeet al [2] proposes to use a combination of orthographicword-clusters and API gazetteer features to enhance theCRFrsquos performance on API extraction tasks Although theseproposed features are effective they require much manualeffort to develop and tune Furthermore enough trainingdata has to be prepared and manually labeled for applyingtheir approach to the software text of each library to beprocessed These overheads pose a practical limitation todeploying their approach to a larger number of libraries

Our neural architecture is inspired by recent advancesof neural network techniques for NLP tasks For exampleboth RNNs and CNNs have been used to embed character-level features for question answering [55] [56] machinetranslation [57] text classification [58] and part-of-speechtagging [59] Some researchers also use word embeddingsand LSTMs for NER [35] [60] [61] To the best of our knowl-edge our neural architecture is the first machine learningbased API extraction method that combines these proposalsand customize them based on the characteristics of soft-ware texts and API names Furthermore the design of ourneural architecture also takes into account the deploymentoverhead of the API methods for multiple programminglanguages and libraries which has never been explored forAPI extraction tasks

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 14

8 CONCULSION

This paper presents a novel multi-layer neural architecturethat can effectively learn important character- word- andsentence-level features from informal software texts for ex-tracting API mentions in text The learned features has supe-rior performance than human-defined orthographic featuresfor API extraction in informal software texts This makes ourneural architecture easy to deploy because the only input itrequires is the texts to be processed In contrast existingmachine learning based API extraction methods have touse additional hand-crafted features such as word clustersor API gazetteers in order to achieve the performanceclose to that of our neural architecture Furthermore as thefeatures are automatically learned from the input texts ourneural architecture is easy to transfer and fine-tune acrossprogramming languages and libraries We demonstrate itstransferability across three Python libraries and across fourprogramming languages Our neural architecture togetherwith transfer learning makes it easy to train and deploya high-quality API extraction model for multiple program-ming languages and libraries with much less overall effortrequired for preparing training data and effective featuresIn the future we will further investigate the performanceand the transferability of our neural architecture in manyother programming languages and libraries moving to-wards real-world deployment of machine learning basedAPI extraction methods

REFERENCES

[1] C Parnin C Treude L Grammel and M-A Storey ldquoCrowddocumentation Exploring the coverage and the dynamics of apidiscussions on stack overflowrdquo

[2] D Ye Z Xing C Y Foo J Li and N Kapre ldquoLearning to extractapi mentions from informal natural language discussionsrdquo in Soft-ware Maintenance and Evolution (ICSME) 2016 IEEE InternationalConference on IEEE 2016 pp 389ndash399

[3] B Dagenais and M P Robillard ldquoRecovering traceability linksbetween an api and its learning resourcesrdquo in Software Engineering(ICSE) 2012 34th International Conference on IEEE 2012 pp 47ndash57

[4] S Subramanian L Inozemtseva and R Holmes ldquoLive api doc-umentationrdquo in Proceedings of the 36th International Conference onSoftware Engineering ACM 2014 pp 643ndash652

[5] C Chen Z Xing and X Wang ldquoUnsupervised software-specificmorphological forms inference from informal discussionsrdquo in Pro-ceedings of the 39th International Conference on Software EngineeringIEEE Press 2017 pp 450ndash461

[6] X Chen C Chen D Zhang and Z Xing ldquoSethesaurus Wordnetin software engineeringrdquo IEEE Transactions on Software Engineer-ing 2019

[7] H Li S Li J Sun Z Xing X Peng M Liu and X Zhao ldquoIm-proving api caveats accessibility by mining api caveats knowledgegraphrdquo in Software Maintenance and Evolution (ICSME) 2018 IEEEInternational Conference on IEEE 2018

[8] Z X D L X W Qiao Huang Xin Xia ldquoApi method recom-mendation without worrying about the task-api knowledge gaprdquoin Automated Software Engineering (ASE) 2018 33th IEEEACMInternational Conference on IEEE 2018

[9] B Xu Z Xing X Xia and D Lo ldquoAnswerbot automated gen-eration of answer summary to developersz technical questionsrdquoin Proceedings of the 32nd IEEEACM International Conference onAutomated Software Engineering IEEE Press 2017 pp 706ndash716

[10] D Ye Z Xing C Y Foo Z Q Ang J Li and N Kapre ldquoSoftware-specific named entity recognition in software engineering socialcontentrdquo in 2016 IEEE 23rd International Conference on SoftwareAnalysis Evolution and Reengineering (SANER) vol 1 IEEE 2016pp 90ndash101

[11] E F Tjong Kim Sang and F De Meulder ldquoIntroduction to the conll-2003 shared task Language-independent named entity recogni-tionrdquo in Proceedings of the seventh conference on Natural languagelearning at HLT-NAACL 2003-Volume 4 Association for Compu-tational Linguistics 2003 pp 142ndash147

[12] R Caruana ldquoLearning many related tasks at the same time withbackpropagationrdquo in Advances in neural information processing sys-tems 1995 pp 657ndash664

[13] A Arnold R Nallapati and W W Cohen ldquoExploiting featurehierarchy for transfer learning in named entity recognitionrdquo Pro-ceedings of ACL-08 HLT pp 245ndash253 2008

[14] J Yosinski J Clune Y Bengio and H Lipson ldquoHow transferableare features in deep neural networksrdquo in Advances in neuralinformation processing systems 2014 pp 3320ndash3328

[15] Z X Deheng Ye Lingfeng Bao and S-W Lin ldquoApireal An apirecognition and linking approach for online developer forumsrdquoEmpirical Software Engineering 2018 2018

[16] C dos Santos and M Gatti ldquoDeep convolutional neural networksfor sentiment analysis of short textsrdquo in Proceedings of COLING2014 the 25th International Conference on Computational LinguisticsTechnical Papers 2014 pp 69ndash78

[17] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[18] G Chen C Chen Z Xing and B Xu ldquoLearning a dual-languagevector space for domain-specific cross-lingual question retrievalrdquoin 2016 31st IEEEACM International Conference on Automated Soft-ware Engineering (ASE) IEEE 2016 pp 744ndash755

[19] C Chen X Chen J Sun Z Xing and G Li ldquoData-driven proac-tive policy assurance of post quality in community qampa sitesrdquoProceedings of the ACM on human-computer interaction vol 2 noCSCW p 33 2018

[20] I Bazzi ldquoModelling out-of-vocabulary words for robust speechrecognitionrdquo PhD dissertation Massachusetts Institute of Tech-nology 2002

[21] C Chen S Gao and Z Xing ldquoMining analogical libraries in qampadiscussionsndashincorporating relational and categorical knowledgeinto word embeddingrdquo in 2016 IEEE 23rd international conferenceon software analysis evolution and reengineering (SANER) vol 1IEEE 2016 pp 338ndash348

[22] C Chen and Z Xing ldquoSimilartech automatically recommendanalogical libraries across different programming languagesrdquo in2016 31st IEEEACM International Conference on Automated SoftwareEngineering (ASE) IEEE 2016 pp 834ndash839

[23] C Chen Z Xing and Y Liu ldquoWhats spains paris mining analogi-cal libraries from qampa discussionsrdquo Empirical Software Engineeringvol 24 no 3 pp 1155ndash1194 2019

[24] Y Huang C Chen Z Xing T Lin and Y Liu ldquoTell them apartdistilling technology differences from crowd-scale comparisondiscussionsrdquo in ASE 2018 pp 214ndash224

[25] O Levy and Y Goldberg ldquoNeural word embedding as implicitmatrix factorizationrdquo in Advances in neural information processingsystems 2014 pp 2177ndash2185

[26] D Tang F Wei N Yang M Zhou T Liu and B Qin ldquoLearningsentiment-specific word embedding for twitter sentiment classifi-cationrdquo in Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) vol 1 2014pp 1555ndash1565

[27] O Levy Y Goldberg and I Dagan ldquoImproving distributional sim-ilarity with lessons learned from word embeddingsrdquo Transactionsof the Association for Computational Linguistics vol 3 pp 211ndash2252015

[28] J Pennington R Socher and C Manning ldquoGlove Global vectorsfor word representationrdquo in Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP) 2014 pp1532ndash1543

[29] I Sutskever O Vinyals and Q V Le ldquoSequence to sequencelearning with neural networksrdquo in Advances in neural informationprocessing systems 2014 pp 3104ndash3112

[30] C Chen Z Xing and Y Liu ldquoBy the community amp for the com-munity a deep learning approach to assist collaborative editing inqampa sitesrdquo Proceedings of the ACM on Human-Computer Interactionvol 1 no CSCW p 32 2017

[31] S Gao C Chen Z Xing Y Ma W Song and S-W Lin ldquoA neuralmodel for method name generation from functional descriptionrdquo

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 15

in 2019 IEEE 26th International Conference on Software AnalysisEvolution and Reengineering (SANER) IEEE 2019 pp 414ndash421

[32] X Wang C Chen and Z Xing ldquoDomain-specific machine trans-lation with recurrent neural network for software localizationrdquoEmpirical Software Engineering pp 1ndash32 2019

[33] C Chen Z Xing Y Liu and K L X Ong ldquoMining likelyanalogical apis across third-party libraries via large-scale unsu-pervised api semantics embeddingrdquo IEEE Transactions on SoftwareEngineering 2019

[34] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no 11pp 2673ndash2681 1997

[35] Z Huang W Xu and K Yu ldquoBidirectional lstm-crf models forsequence taggingrdquo arXiv preprint arXiv150801991 2015

[36] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural computation vol 9 no 8 pp 1735ndash1780 1997

[37] H Sak A Senior and F Beaufays ldquoLong short-term memoryrecurrent neural network architectures for large scale acousticmodelingrdquo in Fifteenth annual conference of the international speechcommunication association 2014

[38] Y Bengio ldquoDeep learning of representations for unsupervised andtransfer learningrdquo in Proceedings of ICML Workshop on Unsupervisedand Transfer Learning 2012 pp 17ndash36

[39] S J Pan Q Yang et al ldquoA survey on transfer learningrdquo[40] N Srivastava G Hinton A Krizhevsky I Sutskever and

R Salakhutdinov ldquoDropout A simple way to prevent neural net-works from overfittingrdquo The Journal of Machine Learning Researchvol 15 no 1 pp 1929ndash1958 2014

[41] D P Kingma and J Ba ldquoAdam A method for stochastic optimiza-tionrdquo arXiv preprint arXiv14126980 2014

[42] A Bacchelli M Lanza and R Robbes ldquoLinking e-mails andsource code artifactsrdquo in Proceedings of the 32nd ACMIEEE Inter-national Conference on Software Engineering-Volume 1 ACM 2010pp 375ndash384

[43] A Bacchelli M DrsquoAmbros M Lanza and R Robbes ldquoBench-marking lightweight techniques to link e-mails and source coderdquoin Reverse Engineering 2009 WCRErsquo09 16th Working Conference onIEEE 2009 pp 205ndash214

[44] S J Pan and Q Yang ldquoA survey on transfer learningrdquo IEEETransactions on knowledge and data engineering vol 22 no 10 pp1345ndash1359 2010

[45] M Oquab L Bottou I Laptev and J Sivic ldquoLearning and transfer-ring mid-level image representations using convolutional neuralnetworksrdquo in Computer Vision and Pattern Recognition (CVPR) 2014IEEE Conference on IEEE 2014 pp 1717ndash1724

[46] S Ravi and H Larochelle ldquoOptimization as a model for few-shotlearningrdquo 2016

[47] A Marcus J Maletic et al ldquoRecovering documentation-to-source-code traceability links using latent semantic indexingrdquo in Soft-ware Engineering 2003 Proceedings 25th International Conference onIEEE 2003 pp 125ndash135

[48] H-Y Jiang T N Nguyen X Chen H Jaygarl and C K ChangldquoIncremental latent semantic indexing for automatic traceabil-ity link evolution managementrdquo in Proceedings of the 2008 23rdIEEEACM International Conference on Automated Software Engineer-ing IEEE Computer Society 2008 pp 59ndash68

[49] W Zheng Q Zhang and M Lyu ldquoCross-library api recommen-dation using web search enginesrdquo in Proceedings of the 19th ACMSIGSOFT symposium and the 13th European conference on Foundationsof software engineering ACM 2011 pp 480ndash483

[50] Q Gao H Zhang J Wang Y Xiong L Zhang and H Mei ldquoFixingrecurring crash bugs via analyzing qampa sites (t)rdquo in AutomatedSoftware Engineering (ASE) 2015 30th IEEEACM International Con-ference on IEEE 2015 pp 307ndash318

[51] C Treude and M P Robillard ldquoAugmenting api documentationwith insights from stack overflowrdquo in Software Engineering (ICSE)2016 IEEEACM 38th International Conference on IEEE 2016 pp392ndash403

[52] P C Rigby and M P Robillard ldquoDiscovering essential codeelements in informal documentationrdquo in Proceedings of the 2013International Conference on Software Engineering IEEE Press 2013pp 832ndash841

[53] L Moonen ldquoGenerating robust parsers using island grammarsrdquoin Reverse Engineering 2001 Proceedings Eighth Working Conferenceon IEEE 2001 pp 13ndash22

[54] A Bacchelli A Cleve M Lanza and A Mocci ldquoExtracting struc-tured data from natural language documents with island parsingrdquo

in Automated Software Engineering (ASE) 2011 26th IEEEACMInternational Conference on IEEE 2011 pp 476ndash479

[55] Y Kim Y Jernite D Sontag and A M Rush ldquoCharacter-awareneural language modelsrdquo in AAAI 2016 pp 2741ndash2749

[56] D Lukovnikov A Fischer J Lehmann and S Auer ldquoNeuralnetwork-based question answering over knowledge graphs onword and character levelrdquo in Proceedings of the 26th internationalconference on World Wide Web International World Wide WebConferences Steering Committee 2017 pp 1211ndash1220

[57] W Ling I Trancoso C Dyer and A W Black ldquoCharacter-basedneural machine translationrdquo arXiv preprint arXiv151104586 2015

[58] X Zhang J Zhao and Y LeCun ldquoCharacter-level convolutionalnetworks for text classificationrdquo in Advances in neural informationprocessing systems 2015 pp 649ndash657

[59] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[60] J P Chiu and E Nichols ldquoNamed entity recognition with bidirec-tional lstm-cnnsrdquo arXiv preprint arXiv151108308 2015

[61] G Lample M Ballesteros S Subramanian K Kawakami andC Dyer ldquoNeural architectures for named entity recognitionrdquoarXiv preprint arXiv160301360 2016

Suyu Ma is a research assistant in the Facultyof Information Technology at Monash UniversityHe has research interest in the areas of soft-ware engineering Deep learning and Human-computer Interaction He is currently focusing onimproving the usability and accessibility of mo-bile application He received the BS and MSdegrees from Beijing Technology and BusinessUniversity and the Australian National Universityin 2016 and 2018 respectively And he will bea PhD student at Monash University in 2020

under the supervision of Chunyang Chen

Zhenchang Xing is a Senior Lecturer in the Re-search School of Computer Science AustralianNational University Previously he was an Assis-tant Professor in the School of Computer Sci-ence and Engineering Nanyang TechnologicalUniversity Singapore from 2012-2016 Beforejoining NTU Dr Xing was a Lee Kuan Yew Re-search Fellow in the School of Computing Na-tional University of Singapore from 2009-2012Dr Xings current research area is in the interdis-plinary areas of software engineering human-

computer interaction and applied data analytics Dr Xing has over 100publications in peer-refereed journals and conference proceedings andhave received several distinguished paper awards from top softwareengineering conferences Dr Xing regularly serves on the organizationand program committees of the top software engineering conferencesand he will be the program committee co-chair for ICSME2020

Chunyang Chen is a lecturer (Assistant Pro-fessor) in Faculty of Information TechnologyMonash University Australia His research fo-cuses on software engineering deep learningand human-computer interaction He has pub-lished over 16 papers in referred journals orconferences including Empirical Software Engi-neering ICSE ASE CSCW ICSME SANERHe is a member of IEEE and ACM He receivedACM SIGSOFT distinguished paper award inASE 2018 best paper award in SANER 2016

and best tool demo in ASE 2016 httpschunyang-chengithubio

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 16

Cheng Chen received his Bachelor degree inSoftware Engineering from Northwest University(China) and received Master degree of Com-puting (major in Artificial Intelligence) from theAustralian National University He did some in-ternship projects about Natural Language Pro-cessing at CSIRO from 2016 to 2017 He workedas a research assistant supervised by Dr Zhen-chang Xing at ANU in 2018 Cheng currentlyworks in PricewaterhouseCoopers (PwC) Firmas a senior algorithm engineer of Natural Lan-

guage Processing and Data Analyzing Cheng is interested in NamedEnti ty Extraction Relation Extraction Text summarization and parallelcomputing He is working on knowledge engineering and transaction forNLP tasks

Lizhen Qu is a research fellow in the DialogueResearch lab in the Faculty of Information Tech-nology at Monash University He has exten-sive research experience in the areas of naturallanguage processing multimodal learning deeplearning and Cybersecurity He is currently fo-cusing on information extraction semantic pars-ing and multimodal dialogue systems Prior tojoining Monash University he worked as a re-search scientist at Data61CSIRO where he ledand participated in several research and indus-

trial projects including Deep Learning for Text and Deep Learning forCyber

Guoqiang Li is now an associate professor inschool of software Shanghai Jiao Tong Univer-sity and a guest associate professor in KyushuUniversity He received the BS MS and PhDdegrees from Taiyuan University of TechnologyShanghai Jiao Tong University and Japan Ad-vanced Institute of Science and Technology in2001 2005 and 2008 respectively He workedas a postdoctoral research fellow in the graduateschool of information science Nagoya Univer-sity Japan during 2008-2009 as an assistant

professor in the school of software Shanghai Jiao Tong Universityduring 2009-2013 and as an academic visitor in the department ofcomputer science University of Oxford during 2015-2016 His researchinterests include formal verification programming language theory anddata analytics and intelligence He published more than 40 researchespapers in the international journals and mainstream conferences includ-ing TDSC SPE TECS IJFCS SCIS CSCW ICSE FORMATS ATVAetc

Page 12: JOURNAL OF LA Easy-to-Deploy API Extraction by Multi-Level ... · sufficient high-quality training data for APIs of some less frequently discussed libraries or frameworks. Another

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 12

all training data In contrast PD rarr MPL has 69 dropin F1-score with 12 training data and NP rarr MPL has107 drop in F1-score with 14 training data comparedwith the Matplotlib model trained from scratch using alltraining data

bull Knowledge about non-polysemous APIs seems easy to expandwhile knowledge about polysemous APIs seems difficult toadapt Matplotlib has the least polysemous APIs (16)and Pandas and NumPy has much more polysemous APIsBoth MPL rarr PD and MPL rarr NP have high-qualitytarget models even with 18 target-library training dataThe transfer learning seems to be able to add the newknowledge ldquocommon words can be APIsrdquo to the target-library model In contrast NP rarr MPL requires at 12training data to have a high-quality target model (F1-score8316) and for PD rarr MPL the Matplotlib model (F1-score8058) by transfer learning with all training data is evenslightly worse than the model trained from scratch (F1-score 827) These results suggest that it can be difficultto adapt the knowledge ldquocommon words can be APIsrdquolearned from NumPy and Pandas to Matplotlib which doesnot have many common-word APIs

Transfer learning can effectively boost the performance of target-library model with less demand on target-library training dataIts effectiveness correlates more with the similarity of libraryfunctionalities and and API-namingmention characteristicsthan the initial quality of source-library model

65 Effectiveness of Across-Language Transfer Learn-ing (RQ4)Motivation In the RQ3 we investigate the transferabilityof API extraction model across three Python libraries Inthe RQ4 we want to further investigate the transferabilityof API extraction model in a more challenging setting ieacross different programming languages and across librarieswith very different functionalitiesApproach For the source language we choose Python oneof the most popular programming languages We randomlyselect 200 posts in each dataset of the three Python librariesand combine these 600 posts as the training data for PythonAPI extraction model For the target languages we choosethree other popular programming languages Java JavaScriptand C That is we have three transfer-learning pairs in thisRQ Python rarr Java Python rarr JavaScript and Python rarrC For the three target languages we intentionally choosethe libraries that support very different functionalities fromthe three Python libraries For Java we choose JDBC (anAPI for accessing relational database) For JavaScript wechoose React (a library for web graphical user interface)For C we choose OpenGL (a library for 3D graphics) Asdescribed in Section 612 we label 600 posts for each target-language library for the experiments As in the RQ3 weuse gradually-reduced target-language training data to fine-tune the source-language-trained modelResults Table 10 Table 11 and Table 12 show the experimentresults for the three across-language transfer-learning pairsThese three tables use the same notation as Table 6 Table 7and Table 8 We can see thatbull Directly reusing the source-language-trained model on the

target-language text produces unacceptable performance For

TABLE 10 Python rarr Java

PythonrarrJava JavaPrec Recall F1 Prec Recall F1

11 7745 6656 7160 8469 5531 669212 7220 6250 6706 7738 5344 632214 6926 5000 5808 7122 4719 567718 5000 6406 5616 7515 3969 5194

116 5571 4825 5200 7569 3406 4698132 5699 4000 4483 7789 2312 3566DU 4444 2875 3491

TABLE 11 Python rarr JavaScript

PythonrarrJavaScript JavaScriptPrec Recall F1 Prec Recall F1

11 7745 8175 8619 8795 7817 827712 8693 7659 8143 8756 5984 777014 8684 6865 7424 8308 6626 737218 8148 6111 6984 8598 5595 6788

116 7111 6349 6808 8738 3848 5344132 6667 5238 5867 6521 3571 4615DU 5163 2500 3369

the three Python libraries directly reusing a source-library-trained model on the text of a target-library maystill have reasonable performance for example NP rarrMPL (F1-score 6916) PD rarr MPL (640) MPL rarr NP(6471) However for the three across-language transfer-learning pairs the F1-score of direct model reuse is verylow Python rarr Java (3491) Python rarr JavaScript (3369)Python rarr C (3595) These results suggest that there arestill certain level of commonalities between the librarieswith similar functionalities and of the same programminglanguage so that the knowledge learned in the source-library model may be largely reused in the target-librarymodel In contrast the libraries of different functionali-ties and programming languages have much fewer com-monalities which makes it infeasible to directly deploya source-language-trained model to the text of anotherlanguage In such cases we have to either train the modelfor each language or library from scratch or we may ex-ploit transfer learning to adapt a source-language-trainedmodel to the target-language text

bull Across-language transfer learning holds the same performancecharacteristics as within-language transfer learning but itdemands more target-language training data to obtain a high-quality model than within-language transfer learning ForPython rarr JavaScript and Python rarr C with as minimumas 116 target-language training data we can obtaina fine-tuned target-language model whose F1-score candouble that of directly reusing the Python-trained modelto JavaScript or C text For Python rarr Java fine-tuning thePython-trained model with 116 Java training data booststhe F1-score by 50 compared with directly applying thePython-trained model to Java text For all the transfer-learning settings in Table 10 Table 11 and Table 12the fine-tuned target-language model always outperformsthe corresponding target-language model trained fromscratch with the same amount of target-language trainingdata However it requires at least 12 target-languagetraining data to fine-tune a target-language model toachieve the same level of performance as the target-language model trained from scratch with all target-

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 13

TABLE 12 Python rarr C

PythonrarrC CPrec Recall F1 Prec Recall F1

11 8987 8509 8747 8583 8374 847712 8535 8373 8454 8628 7669 812114 8383 7588 7965 8219 7127 763418 7432 7371 7401 7868 6802 7297

116 7581 6945 7260 8852 5710 6942132 6962 6585 6779 8789 4525 5975DU 5649 2791 3595

language training data For the within-language transferlearning achieving this result may require as minimum as18 target-library training data

Software text of different programming languages and of li-braries with very different functionalities increases the difficultyof transfer learning but transfer learning can still effectivelyboost the performance of the target-language model with 12less target-language training data compared with training thetarget-language model from scratch with all training data

66 Threats to Validity

The major threat to internal validity is the API labelingerrors in the dataset In order to decrease errors the first twoauthors first independently label the same data and thenresolve the disagreements by discussion Furthermore inour experiments we have to examine many API extractionresults We only spot a few instances of overlooked orerroneously-labeled API mentions Therefore the quality ofour dataset is trustworthy

The major threat to external validity is the generalizationof our results and findings Although we invest significanttime and effort to prepare datasets conduct experimentsand analyze results our experiments involve only six li-braries of four programming languages The performance ofour neural architecture and especially the findings on trans-fer learning could be different with other programminglanguages and libraries Furthermore the software text inall the experiments comes from a single data source ieStack Overflow Software text from other data sources mayexhibit different API-mention and discussion-context char-acteristics which may affect the model performance In thefuture we will reduce this threat by applying our approachto more languageslibraries and informal software text fromother data sources (eg developer emails)

7 RELATED WORK

APIs are the core resources for software development andAPI related knowledge is widely present in software textssuch as API reference documentation QampA discussionsbug reports Researchers have proposed many API extrac-tion methods to support various software engineering tasksespecially document traceability recovery [42] [47] [48][49] [50] For example RecoDoc [3] extracts Java APIsfrom several resources and then recover traceability acrossdifferent sources Subramanian et al [4] use code contextinformation to filter candidate APIs in a knowledge basefor an API mention in a partial code fragment Christophand Robillard [51] extracts sentences about API usage from

Stack Overflow and use them to augment API referencedocumentation These works focus mainly on the API link-ing task ie linking API mentions in text or code to someAPI entities in a knowledge base In terms of extracting APImentions in software text they rely on rule-based methodsfor example regular expressions of distinct orthographicfeatures of APIs such as camelcase special characters (eg or ()) and API annotations

Several studies such as Bacchelli et al [42] [43] and Yeet al [2] show that regular expressions of distinct ortho-graphic features are not reliable for API extraction tasks ininformal texts such as emails Stack Overflow posts Bothvariations of sentence formats and the wide presence ofmentions of polysemous API simple names pose a greatchallenge for API extraction in informal texts Howeverthis challenge is generally avoided by considering onlyAPI mentions with distinct orthographic features in existingworks [3] [51] [52]

Island parsing provides a more robust solution for ex-tracting API mentions from texts Using an island grammarwe can separate the textual content into constructs of in-terest (island) and the remainder (water) [53] For exampleBacchelli et al [54] uses island parsing to extract code frag-ments from natural language text Rigby and Robillard [52]also use island parser to identify code-like elements that canpotentially be APIs However these island parsers cannoteffectively deal with mentions of API simple names that arenot suffixed by () such as apply and series in Fig 1 whichare very common writing forms of API methods in StackOverflow discussions [2]

Recently Ye et al [10] proposes a CRF based approachfor extracting methods of software entities such as pro-gramming languages libraries computing concepts andAPIs from informal software texts They report that extractAPI mentions is much more challenging than extractingother types of software entities A follow-up work by Yeet al [2] proposes to use a combination of orthographicword-clusters and API gazetteer features to enhance theCRFrsquos performance on API extraction tasks Although theseproposed features are effective they require much manualeffort to develop and tune Furthermore enough trainingdata has to be prepared and manually labeled for applyingtheir approach to the software text of each library to beprocessed These overheads pose a practical limitation todeploying their approach to a larger number of libraries

Our neural architecture is inspired by recent advancesof neural network techniques for NLP tasks For exampleboth RNNs and CNNs have been used to embed character-level features for question answering [55] [56] machinetranslation [57] text classification [58] and part-of-speechtagging [59] Some researchers also use word embeddingsand LSTMs for NER [35] [60] [61] To the best of our knowl-edge our neural architecture is the first machine learningbased API extraction method that combines these proposalsand customize them based on the characteristics of soft-ware texts and API names Furthermore the design of ourneural architecture also takes into account the deploymentoverhead of the API methods for multiple programminglanguages and libraries which has never been explored forAPI extraction tasks

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 14

8 CONCULSION

This paper presents a novel multi-layer neural architecturethat can effectively learn important character- word- andsentence-level features from informal software texts for ex-tracting API mentions in text The learned features has supe-rior performance than human-defined orthographic featuresfor API extraction in informal software texts This makes ourneural architecture easy to deploy because the only input itrequires is the texts to be processed In contrast existingmachine learning based API extraction methods have touse additional hand-crafted features such as word clustersor API gazetteers in order to achieve the performanceclose to that of our neural architecture Furthermore as thefeatures are automatically learned from the input texts ourneural architecture is easy to transfer and fine-tune acrossprogramming languages and libraries We demonstrate itstransferability across three Python libraries and across fourprogramming languages Our neural architecture togetherwith transfer learning makes it easy to train and deploya high-quality API extraction model for multiple program-ming languages and libraries with much less overall effortrequired for preparing training data and effective featuresIn the future we will further investigate the performanceand the transferability of our neural architecture in manyother programming languages and libraries moving to-wards real-world deployment of machine learning basedAPI extraction methods

REFERENCES

[1] C Parnin C Treude L Grammel and M-A Storey ldquoCrowddocumentation Exploring the coverage and the dynamics of apidiscussions on stack overflowrdquo

[2] D Ye Z Xing C Y Foo J Li and N Kapre ldquoLearning to extractapi mentions from informal natural language discussionsrdquo in Soft-ware Maintenance and Evolution (ICSME) 2016 IEEE InternationalConference on IEEE 2016 pp 389ndash399

[3] B Dagenais and M P Robillard ldquoRecovering traceability linksbetween an api and its learning resourcesrdquo in Software Engineering(ICSE) 2012 34th International Conference on IEEE 2012 pp 47ndash57

[4] S Subramanian L Inozemtseva and R Holmes ldquoLive api doc-umentationrdquo in Proceedings of the 36th International Conference onSoftware Engineering ACM 2014 pp 643ndash652

[5] C Chen Z Xing and X Wang ldquoUnsupervised software-specificmorphological forms inference from informal discussionsrdquo in Pro-ceedings of the 39th International Conference on Software EngineeringIEEE Press 2017 pp 450ndash461

[6] X Chen C Chen D Zhang and Z Xing ldquoSethesaurus Wordnetin software engineeringrdquo IEEE Transactions on Software Engineer-ing 2019

[7] H Li S Li J Sun Z Xing X Peng M Liu and X Zhao ldquoIm-proving api caveats accessibility by mining api caveats knowledgegraphrdquo in Software Maintenance and Evolution (ICSME) 2018 IEEEInternational Conference on IEEE 2018

[8] Z X D L X W Qiao Huang Xin Xia ldquoApi method recom-mendation without worrying about the task-api knowledge gaprdquoin Automated Software Engineering (ASE) 2018 33th IEEEACMInternational Conference on IEEE 2018

[9] B Xu Z Xing X Xia and D Lo ldquoAnswerbot automated gen-eration of answer summary to developersz technical questionsrdquoin Proceedings of the 32nd IEEEACM International Conference onAutomated Software Engineering IEEE Press 2017 pp 706ndash716

[10] D Ye Z Xing C Y Foo Z Q Ang J Li and N Kapre ldquoSoftware-specific named entity recognition in software engineering socialcontentrdquo in 2016 IEEE 23rd International Conference on SoftwareAnalysis Evolution and Reengineering (SANER) vol 1 IEEE 2016pp 90ndash101

[11] E F Tjong Kim Sang and F De Meulder ldquoIntroduction to the conll-2003 shared task Language-independent named entity recogni-tionrdquo in Proceedings of the seventh conference on Natural languagelearning at HLT-NAACL 2003-Volume 4 Association for Compu-tational Linguistics 2003 pp 142ndash147

[12] R Caruana ldquoLearning many related tasks at the same time withbackpropagationrdquo in Advances in neural information processing sys-tems 1995 pp 657ndash664

[13] A Arnold R Nallapati and W W Cohen ldquoExploiting featurehierarchy for transfer learning in named entity recognitionrdquo Pro-ceedings of ACL-08 HLT pp 245ndash253 2008

[14] J Yosinski J Clune Y Bengio and H Lipson ldquoHow transferableare features in deep neural networksrdquo in Advances in neuralinformation processing systems 2014 pp 3320ndash3328

[15] Z X Deheng Ye Lingfeng Bao and S-W Lin ldquoApireal An apirecognition and linking approach for online developer forumsrdquoEmpirical Software Engineering 2018 2018

[16] C dos Santos and M Gatti ldquoDeep convolutional neural networksfor sentiment analysis of short textsrdquo in Proceedings of COLING2014 the 25th International Conference on Computational LinguisticsTechnical Papers 2014 pp 69ndash78

[17] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[18] G Chen C Chen Z Xing and B Xu ldquoLearning a dual-languagevector space for domain-specific cross-lingual question retrievalrdquoin 2016 31st IEEEACM International Conference on Automated Soft-ware Engineering (ASE) IEEE 2016 pp 744ndash755

[19] C Chen X Chen J Sun Z Xing and G Li ldquoData-driven proac-tive policy assurance of post quality in community qampa sitesrdquoProceedings of the ACM on human-computer interaction vol 2 noCSCW p 33 2018

[20] I Bazzi ldquoModelling out-of-vocabulary words for robust speechrecognitionrdquo PhD dissertation Massachusetts Institute of Tech-nology 2002

[21] C Chen S Gao and Z Xing ldquoMining analogical libraries in qampadiscussionsndashincorporating relational and categorical knowledgeinto word embeddingrdquo in 2016 IEEE 23rd international conferenceon software analysis evolution and reengineering (SANER) vol 1IEEE 2016 pp 338ndash348

[22] C Chen and Z Xing ldquoSimilartech automatically recommendanalogical libraries across different programming languagesrdquo in2016 31st IEEEACM International Conference on Automated SoftwareEngineering (ASE) IEEE 2016 pp 834ndash839

[23] C Chen Z Xing and Y Liu ldquoWhats spains paris mining analogi-cal libraries from qampa discussionsrdquo Empirical Software Engineeringvol 24 no 3 pp 1155ndash1194 2019

[24] Y Huang C Chen Z Xing T Lin and Y Liu ldquoTell them apartdistilling technology differences from crowd-scale comparisondiscussionsrdquo in ASE 2018 pp 214ndash224

[25] O Levy and Y Goldberg ldquoNeural word embedding as implicitmatrix factorizationrdquo in Advances in neural information processingsystems 2014 pp 2177ndash2185

[26] D Tang F Wei N Yang M Zhou T Liu and B Qin ldquoLearningsentiment-specific word embedding for twitter sentiment classifi-cationrdquo in Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) vol 1 2014pp 1555ndash1565

[27] O Levy Y Goldberg and I Dagan ldquoImproving distributional sim-ilarity with lessons learned from word embeddingsrdquo Transactionsof the Association for Computational Linguistics vol 3 pp 211ndash2252015

[28] J Pennington R Socher and C Manning ldquoGlove Global vectorsfor word representationrdquo in Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP) 2014 pp1532ndash1543

[29] I Sutskever O Vinyals and Q V Le ldquoSequence to sequencelearning with neural networksrdquo in Advances in neural informationprocessing systems 2014 pp 3104ndash3112

[30] C Chen Z Xing and Y Liu ldquoBy the community amp for the com-munity a deep learning approach to assist collaborative editing inqampa sitesrdquo Proceedings of the ACM on Human-Computer Interactionvol 1 no CSCW p 32 2017

[31] S Gao C Chen Z Xing Y Ma W Song and S-W Lin ldquoA neuralmodel for method name generation from functional descriptionrdquo

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 15

in 2019 IEEE 26th International Conference on Software AnalysisEvolution and Reengineering (SANER) IEEE 2019 pp 414ndash421

[32] X Wang C Chen and Z Xing ldquoDomain-specific machine trans-lation with recurrent neural network for software localizationrdquoEmpirical Software Engineering pp 1ndash32 2019

[33] C Chen Z Xing Y Liu and K L X Ong ldquoMining likelyanalogical apis across third-party libraries via large-scale unsu-pervised api semantics embeddingrdquo IEEE Transactions on SoftwareEngineering 2019

[34] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no 11pp 2673ndash2681 1997

[35] Z Huang W Xu and K Yu ldquoBidirectional lstm-crf models forsequence taggingrdquo arXiv preprint arXiv150801991 2015

[36] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural computation vol 9 no 8 pp 1735ndash1780 1997

[37] H Sak A Senior and F Beaufays ldquoLong short-term memoryrecurrent neural network architectures for large scale acousticmodelingrdquo in Fifteenth annual conference of the international speechcommunication association 2014

[38] Y Bengio ldquoDeep learning of representations for unsupervised andtransfer learningrdquo in Proceedings of ICML Workshop on Unsupervisedand Transfer Learning 2012 pp 17ndash36

[39] S J Pan Q Yang et al ldquoA survey on transfer learningrdquo[40] N Srivastava G Hinton A Krizhevsky I Sutskever and

R Salakhutdinov ldquoDropout A simple way to prevent neural net-works from overfittingrdquo The Journal of Machine Learning Researchvol 15 no 1 pp 1929ndash1958 2014

[41] D P Kingma and J Ba ldquoAdam A method for stochastic optimiza-tionrdquo arXiv preprint arXiv14126980 2014

[42] A Bacchelli M Lanza and R Robbes ldquoLinking e-mails andsource code artifactsrdquo in Proceedings of the 32nd ACMIEEE Inter-national Conference on Software Engineering-Volume 1 ACM 2010pp 375ndash384

[43] A Bacchelli M DrsquoAmbros M Lanza and R Robbes ldquoBench-marking lightweight techniques to link e-mails and source coderdquoin Reverse Engineering 2009 WCRErsquo09 16th Working Conference onIEEE 2009 pp 205ndash214

[44] S J Pan and Q Yang ldquoA survey on transfer learningrdquo IEEETransactions on knowledge and data engineering vol 22 no 10 pp1345ndash1359 2010

[45] M Oquab L Bottou I Laptev and J Sivic ldquoLearning and transfer-ring mid-level image representations using convolutional neuralnetworksrdquo in Computer Vision and Pattern Recognition (CVPR) 2014IEEE Conference on IEEE 2014 pp 1717ndash1724

[46] S Ravi and H Larochelle ldquoOptimization as a model for few-shotlearningrdquo 2016

[47] A Marcus J Maletic et al ldquoRecovering documentation-to-source-code traceability links using latent semantic indexingrdquo in Soft-ware Engineering 2003 Proceedings 25th International Conference onIEEE 2003 pp 125ndash135

[48] H-Y Jiang T N Nguyen X Chen H Jaygarl and C K ChangldquoIncremental latent semantic indexing for automatic traceabil-ity link evolution managementrdquo in Proceedings of the 2008 23rdIEEEACM International Conference on Automated Software Engineer-ing IEEE Computer Society 2008 pp 59ndash68

[49] W Zheng Q Zhang and M Lyu ldquoCross-library api recommen-dation using web search enginesrdquo in Proceedings of the 19th ACMSIGSOFT symposium and the 13th European conference on Foundationsof software engineering ACM 2011 pp 480ndash483

[50] Q Gao H Zhang J Wang Y Xiong L Zhang and H Mei ldquoFixingrecurring crash bugs via analyzing qampa sites (t)rdquo in AutomatedSoftware Engineering (ASE) 2015 30th IEEEACM International Con-ference on IEEE 2015 pp 307ndash318

[51] C Treude and M P Robillard ldquoAugmenting api documentationwith insights from stack overflowrdquo in Software Engineering (ICSE)2016 IEEEACM 38th International Conference on IEEE 2016 pp392ndash403

[52] P C Rigby and M P Robillard ldquoDiscovering essential codeelements in informal documentationrdquo in Proceedings of the 2013International Conference on Software Engineering IEEE Press 2013pp 832ndash841

[53] L Moonen ldquoGenerating robust parsers using island grammarsrdquoin Reverse Engineering 2001 Proceedings Eighth Working Conferenceon IEEE 2001 pp 13ndash22

[54] A Bacchelli A Cleve M Lanza and A Mocci ldquoExtracting struc-tured data from natural language documents with island parsingrdquo

in Automated Software Engineering (ASE) 2011 26th IEEEACMInternational Conference on IEEE 2011 pp 476ndash479

[55] Y Kim Y Jernite D Sontag and A M Rush ldquoCharacter-awareneural language modelsrdquo in AAAI 2016 pp 2741ndash2749

[56] D Lukovnikov A Fischer J Lehmann and S Auer ldquoNeuralnetwork-based question answering over knowledge graphs onword and character levelrdquo in Proceedings of the 26th internationalconference on World Wide Web International World Wide WebConferences Steering Committee 2017 pp 1211ndash1220

[57] W Ling I Trancoso C Dyer and A W Black ldquoCharacter-basedneural machine translationrdquo arXiv preprint arXiv151104586 2015

[58] X Zhang J Zhao and Y LeCun ldquoCharacter-level convolutionalnetworks for text classificationrdquo in Advances in neural informationprocessing systems 2015 pp 649ndash657

[59] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[60] J P Chiu and E Nichols ldquoNamed entity recognition with bidirec-tional lstm-cnnsrdquo arXiv preprint arXiv151108308 2015

[61] G Lample M Ballesteros S Subramanian K Kawakami andC Dyer ldquoNeural architectures for named entity recognitionrdquoarXiv preprint arXiv160301360 2016

Suyu Ma is a research assistant in the Facultyof Information Technology at Monash UniversityHe has research interest in the areas of soft-ware engineering Deep learning and Human-computer Interaction He is currently focusing onimproving the usability and accessibility of mo-bile application He received the BS and MSdegrees from Beijing Technology and BusinessUniversity and the Australian National Universityin 2016 and 2018 respectively And he will bea PhD student at Monash University in 2020

under the supervision of Chunyang Chen

Zhenchang Xing is a Senior Lecturer in the Re-search School of Computer Science AustralianNational University Previously he was an Assis-tant Professor in the School of Computer Sci-ence and Engineering Nanyang TechnologicalUniversity Singapore from 2012-2016 Beforejoining NTU Dr Xing was a Lee Kuan Yew Re-search Fellow in the School of Computing Na-tional University of Singapore from 2009-2012Dr Xings current research area is in the interdis-plinary areas of software engineering human-

computer interaction and applied data analytics Dr Xing has over 100publications in peer-refereed journals and conference proceedings andhave received several distinguished paper awards from top softwareengineering conferences Dr Xing regularly serves on the organizationand program committees of the top software engineering conferencesand he will be the program committee co-chair for ICSME2020

Chunyang Chen is a lecturer (Assistant Pro-fessor) in Faculty of Information TechnologyMonash University Australia His research fo-cuses on software engineering deep learningand human-computer interaction He has pub-lished over 16 papers in referred journals orconferences including Empirical Software Engi-neering ICSE ASE CSCW ICSME SANERHe is a member of IEEE and ACM He receivedACM SIGSOFT distinguished paper award inASE 2018 best paper award in SANER 2016

and best tool demo in ASE 2016 httpschunyang-chengithubio

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 16

Cheng Chen received his Bachelor degree inSoftware Engineering from Northwest University(China) and received Master degree of Com-puting (major in Artificial Intelligence) from theAustralian National University He did some in-ternship projects about Natural Language Pro-cessing at CSIRO from 2016 to 2017 He workedas a research assistant supervised by Dr Zhen-chang Xing at ANU in 2018 Cheng currentlyworks in PricewaterhouseCoopers (PwC) Firmas a senior algorithm engineer of Natural Lan-

guage Processing and Data Analyzing Cheng is interested in NamedEnti ty Extraction Relation Extraction Text summarization and parallelcomputing He is working on knowledge engineering and transaction forNLP tasks

Lizhen Qu is a research fellow in the DialogueResearch lab in the Faculty of Information Tech-nology at Monash University He has exten-sive research experience in the areas of naturallanguage processing multimodal learning deeplearning and Cybersecurity He is currently fo-cusing on information extraction semantic pars-ing and multimodal dialogue systems Prior tojoining Monash University he worked as a re-search scientist at Data61CSIRO where he ledand participated in several research and indus-

trial projects including Deep Learning for Text and Deep Learning forCyber

Guoqiang Li is now an associate professor inschool of software Shanghai Jiao Tong Univer-sity and a guest associate professor in KyushuUniversity He received the BS MS and PhDdegrees from Taiyuan University of TechnologyShanghai Jiao Tong University and Japan Ad-vanced Institute of Science and Technology in2001 2005 and 2008 respectively He workedas a postdoctoral research fellow in the graduateschool of information science Nagoya Univer-sity Japan during 2008-2009 as an assistant

professor in the school of software Shanghai Jiao Tong Universityduring 2009-2013 and as an academic visitor in the department ofcomputer science University of Oxford during 2015-2016 His researchinterests include formal verification programming language theory anddata analytics and intelligence He published more than 40 researchespapers in the international journals and mainstream conferences includ-ing TDSC SPE TECS IJFCS SCIS CSCW ICSE FORMATS ATVAetc

Page 13: JOURNAL OF LA Easy-to-Deploy API Extraction by Multi-Level ... · sufficient high-quality training data for APIs of some less frequently discussed libraries or frameworks. Another

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 13

TABLE 12 Python rarr C

PythonrarrC CPrec Recall F1 Prec Recall F1

11 8987 8509 8747 8583 8374 847712 8535 8373 8454 8628 7669 812114 8383 7588 7965 8219 7127 763418 7432 7371 7401 7868 6802 7297

116 7581 6945 7260 8852 5710 6942132 6962 6585 6779 8789 4525 5975DU 5649 2791 3595

language training data For the within-language transferlearning achieving this result may require as minimum as18 target-library training data

Software text of different programming languages and of li-braries with very different functionalities increases the difficultyof transfer learning but transfer learning can still effectivelyboost the performance of the target-language model with 12less target-language training data compared with training thetarget-language model from scratch with all training data

66 Threats to Validity

The major threat to internal validity is the API labelingerrors in the dataset In order to decrease errors the first twoauthors first independently label the same data and thenresolve the disagreements by discussion Furthermore inour experiments we have to examine many API extractionresults We only spot a few instances of overlooked orerroneously-labeled API mentions Therefore the quality ofour dataset is trustworthy

The major threat to external validity is the generalizationof our results and findings Although we invest significanttime and effort to prepare datasets conduct experimentsand analyze results our experiments involve only six li-braries of four programming languages The performance ofour neural architecture and especially the findings on trans-fer learning could be different with other programminglanguages and libraries Furthermore the software text inall the experiments comes from a single data source ieStack Overflow Software text from other data sources mayexhibit different API-mention and discussion-context char-acteristics which may affect the model performance In thefuture we will reduce this threat by applying our approachto more languageslibraries and informal software text fromother data sources (eg developer emails)

7 RELATED WORK

APIs are the core resources for software development andAPI related knowledge is widely present in software textssuch as API reference documentation QampA discussionsbug reports Researchers have proposed many API extrac-tion methods to support various software engineering tasksespecially document traceability recovery [42] [47] [48][49] [50] For example RecoDoc [3] extracts Java APIsfrom several resources and then recover traceability acrossdifferent sources Subramanian et al [4] use code contextinformation to filter candidate APIs in a knowledge basefor an API mention in a partial code fragment Christophand Robillard [51] extracts sentences about API usage from

Stack Overflow and use them to augment API referencedocumentation These works focus mainly on the API link-ing task ie linking API mentions in text or code to someAPI entities in a knowledge base In terms of extracting APImentions in software text they rely on rule-based methodsfor example regular expressions of distinct orthographicfeatures of APIs such as camelcase special characters (eg or ()) and API annotations

Several studies such as Bacchelli et al [42] [43] and Yeet al [2] show that regular expressions of distinct ortho-graphic features are not reliable for API extraction tasks ininformal texts such as emails Stack Overflow posts Bothvariations of sentence formats and the wide presence ofmentions of polysemous API simple names pose a greatchallenge for API extraction in informal texts Howeverthis challenge is generally avoided by considering onlyAPI mentions with distinct orthographic features in existingworks [3] [51] [52]

Island parsing provides a more robust solution for ex-tracting API mentions from texts Using an island grammarwe can separate the textual content into constructs of in-terest (island) and the remainder (water) [53] For exampleBacchelli et al [54] uses island parsing to extract code frag-ments from natural language text Rigby and Robillard [52]also use island parser to identify code-like elements that canpotentially be APIs However these island parsers cannoteffectively deal with mentions of API simple names that arenot suffixed by () such as apply and series in Fig 1 whichare very common writing forms of API methods in StackOverflow discussions [2]

Recently Ye et al [10] proposes a CRF based approachfor extracting methods of software entities such as pro-gramming languages libraries computing concepts andAPIs from informal software texts They report that extractAPI mentions is much more challenging than extractingother types of software entities A follow-up work by Yeet al [2] proposes to use a combination of orthographicword-clusters and API gazetteer features to enhance theCRFrsquos performance on API extraction tasks Although theseproposed features are effective they require much manualeffort to develop and tune Furthermore enough trainingdata has to be prepared and manually labeled for applyingtheir approach to the software text of each library to beprocessed These overheads pose a practical limitation todeploying their approach to a larger number of libraries

Our neural architecture is inspired by recent advancesof neural network techniques for NLP tasks For exampleboth RNNs and CNNs have been used to embed character-level features for question answering [55] [56] machinetranslation [57] text classification [58] and part-of-speechtagging [59] Some researchers also use word embeddingsand LSTMs for NER [35] [60] [61] To the best of our knowl-edge our neural architecture is the first machine learningbased API extraction method that combines these proposalsand customize them based on the characteristics of soft-ware texts and API names Furthermore the design of ourneural architecture also takes into account the deploymentoverhead of the API methods for multiple programminglanguages and libraries which has never been explored forAPI extraction tasks

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 14

8 CONCULSION

This paper presents a novel multi-layer neural architecturethat can effectively learn important character- word- andsentence-level features from informal software texts for ex-tracting API mentions in text The learned features has supe-rior performance than human-defined orthographic featuresfor API extraction in informal software texts This makes ourneural architecture easy to deploy because the only input itrequires is the texts to be processed In contrast existingmachine learning based API extraction methods have touse additional hand-crafted features such as word clustersor API gazetteers in order to achieve the performanceclose to that of our neural architecture Furthermore as thefeatures are automatically learned from the input texts ourneural architecture is easy to transfer and fine-tune acrossprogramming languages and libraries We demonstrate itstransferability across three Python libraries and across fourprogramming languages Our neural architecture togetherwith transfer learning makes it easy to train and deploya high-quality API extraction model for multiple program-ming languages and libraries with much less overall effortrequired for preparing training data and effective featuresIn the future we will further investigate the performanceand the transferability of our neural architecture in manyother programming languages and libraries moving to-wards real-world deployment of machine learning basedAPI extraction methods

REFERENCES

[1] C Parnin C Treude L Grammel and M-A Storey ldquoCrowddocumentation Exploring the coverage and the dynamics of apidiscussions on stack overflowrdquo

[2] D Ye Z Xing C Y Foo J Li and N Kapre ldquoLearning to extractapi mentions from informal natural language discussionsrdquo in Soft-ware Maintenance and Evolution (ICSME) 2016 IEEE InternationalConference on IEEE 2016 pp 389ndash399

[3] B Dagenais and M P Robillard ldquoRecovering traceability linksbetween an api and its learning resourcesrdquo in Software Engineering(ICSE) 2012 34th International Conference on IEEE 2012 pp 47ndash57

[4] S Subramanian L Inozemtseva and R Holmes ldquoLive api doc-umentationrdquo in Proceedings of the 36th International Conference onSoftware Engineering ACM 2014 pp 643ndash652

[5] C Chen Z Xing and X Wang ldquoUnsupervised software-specificmorphological forms inference from informal discussionsrdquo in Pro-ceedings of the 39th International Conference on Software EngineeringIEEE Press 2017 pp 450ndash461

[6] X Chen C Chen D Zhang and Z Xing ldquoSethesaurus Wordnetin software engineeringrdquo IEEE Transactions on Software Engineer-ing 2019

[7] H Li S Li J Sun Z Xing X Peng M Liu and X Zhao ldquoIm-proving api caveats accessibility by mining api caveats knowledgegraphrdquo in Software Maintenance and Evolution (ICSME) 2018 IEEEInternational Conference on IEEE 2018

[8] Z X D L X W Qiao Huang Xin Xia ldquoApi method recom-mendation without worrying about the task-api knowledge gaprdquoin Automated Software Engineering (ASE) 2018 33th IEEEACMInternational Conference on IEEE 2018

[9] B Xu Z Xing X Xia and D Lo ldquoAnswerbot automated gen-eration of answer summary to developersz technical questionsrdquoin Proceedings of the 32nd IEEEACM International Conference onAutomated Software Engineering IEEE Press 2017 pp 706ndash716

[10] D Ye Z Xing C Y Foo Z Q Ang J Li and N Kapre ldquoSoftware-specific named entity recognition in software engineering socialcontentrdquo in 2016 IEEE 23rd International Conference on SoftwareAnalysis Evolution and Reengineering (SANER) vol 1 IEEE 2016pp 90ndash101

[11] E F Tjong Kim Sang and F De Meulder ldquoIntroduction to the conll-2003 shared task Language-independent named entity recogni-tionrdquo in Proceedings of the seventh conference on Natural languagelearning at HLT-NAACL 2003-Volume 4 Association for Compu-tational Linguistics 2003 pp 142ndash147

[12] R Caruana ldquoLearning many related tasks at the same time withbackpropagationrdquo in Advances in neural information processing sys-tems 1995 pp 657ndash664

[13] A Arnold R Nallapati and W W Cohen ldquoExploiting featurehierarchy for transfer learning in named entity recognitionrdquo Pro-ceedings of ACL-08 HLT pp 245ndash253 2008

[14] J Yosinski J Clune Y Bengio and H Lipson ldquoHow transferableare features in deep neural networksrdquo in Advances in neuralinformation processing systems 2014 pp 3320ndash3328

[15] Z X Deheng Ye Lingfeng Bao and S-W Lin ldquoApireal An apirecognition and linking approach for online developer forumsrdquoEmpirical Software Engineering 2018 2018

[16] C dos Santos and M Gatti ldquoDeep convolutional neural networksfor sentiment analysis of short textsrdquo in Proceedings of COLING2014 the 25th International Conference on Computational LinguisticsTechnical Papers 2014 pp 69ndash78

[17] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[18] G Chen C Chen Z Xing and B Xu ldquoLearning a dual-languagevector space for domain-specific cross-lingual question retrievalrdquoin 2016 31st IEEEACM International Conference on Automated Soft-ware Engineering (ASE) IEEE 2016 pp 744ndash755

[19] C Chen X Chen J Sun Z Xing and G Li ldquoData-driven proac-tive policy assurance of post quality in community qampa sitesrdquoProceedings of the ACM on human-computer interaction vol 2 noCSCW p 33 2018

[20] I Bazzi ldquoModelling out-of-vocabulary words for robust speechrecognitionrdquo PhD dissertation Massachusetts Institute of Tech-nology 2002

[21] C Chen S Gao and Z Xing ldquoMining analogical libraries in qampadiscussionsndashincorporating relational and categorical knowledgeinto word embeddingrdquo in 2016 IEEE 23rd international conferenceon software analysis evolution and reengineering (SANER) vol 1IEEE 2016 pp 338ndash348

[22] C Chen and Z Xing ldquoSimilartech automatically recommendanalogical libraries across different programming languagesrdquo in2016 31st IEEEACM International Conference on Automated SoftwareEngineering (ASE) IEEE 2016 pp 834ndash839

[23] C Chen Z Xing and Y Liu ldquoWhats spains paris mining analogi-cal libraries from qampa discussionsrdquo Empirical Software Engineeringvol 24 no 3 pp 1155ndash1194 2019

[24] Y Huang C Chen Z Xing T Lin and Y Liu ldquoTell them apartdistilling technology differences from crowd-scale comparisondiscussionsrdquo in ASE 2018 pp 214ndash224

[25] O Levy and Y Goldberg ldquoNeural word embedding as implicitmatrix factorizationrdquo in Advances in neural information processingsystems 2014 pp 2177ndash2185

[26] D Tang F Wei N Yang M Zhou T Liu and B Qin ldquoLearningsentiment-specific word embedding for twitter sentiment classifi-cationrdquo in Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) vol 1 2014pp 1555ndash1565

[27] O Levy Y Goldberg and I Dagan ldquoImproving distributional sim-ilarity with lessons learned from word embeddingsrdquo Transactionsof the Association for Computational Linguistics vol 3 pp 211ndash2252015

[28] J Pennington R Socher and C Manning ldquoGlove Global vectorsfor word representationrdquo in Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP) 2014 pp1532ndash1543

[29] I Sutskever O Vinyals and Q V Le ldquoSequence to sequencelearning with neural networksrdquo in Advances in neural informationprocessing systems 2014 pp 3104ndash3112

[30] C Chen Z Xing and Y Liu ldquoBy the community amp for the com-munity a deep learning approach to assist collaborative editing inqampa sitesrdquo Proceedings of the ACM on Human-Computer Interactionvol 1 no CSCW p 32 2017

[31] S Gao C Chen Z Xing Y Ma W Song and S-W Lin ldquoA neuralmodel for method name generation from functional descriptionrdquo

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 15

in 2019 IEEE 26th International Conference on Software AnalysisEvolution and Reengineering (SANER) IEEE 2019 pp 414ndash421

[32] X Wang C Chen and Z Xing ldquoDomain-specific machine trans-lation with recurrent neural network for software localizationrdquoEmpirical Software Engineering pp 1ndash32 2019

[33] C Chen Z Xing Y Liu and K L X Ong ldquoMining likelyanalogical apis across third-party libraries via large-scale unsu-pervised api semantics embeddingrdquo IEEE Transactions on SoftwareEngineering 2019

[34] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no 11pp 2673ndash2681 1997

[35] Z Huang W Xu and K Yu ldquoBidirectional lstm-crf models forsequence taggingrdquo arXiv preprint arXiv150801991 2015

[36] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural computation vol 9 no 8 pp 1735ndash1780 1997

[37] H Sak A Senior and F Beaufays ldquoLong short-term memoryrecurrent neural network architectures for large scale acousticmodelingrdquo in Fifteenth annual conference of the international speechcommunication association 2014

[38] Y Bengio ldquoDeep learning of representations for unsupervised andtransfer learningrdquo in Proceedings of ICML Workshop on Unsupervisedand Transfer Learning 2012 pp 17ndash36

[39] S J Pan Q Yang et al ldquoA survey on transfer learningrdquo[40] N Srivastava G Hinton A Krizhevsky I Sutskever and

R Salakhutdinov ldquoDropout A simple way to prevent neural net-works from overfittingrdquo The Journal of Machine Learning Researchvol 15 no 1 pp 1929ndash1958 2014

[41] D P Kingma and J Ba ldquoAdam A method for stochastic optimiza-tionrdquo arXiv preprint arXiv14126980 2014

[42] A Bacchelli M Lanza and R Robbes ldquoLinking e-mails andsource code artifactsrdquo in Proceedings of the 32nd ACMIEEE Inter-national Conference on Software Engineering-Volume 1 ACM 2010pp 375ndash384

[43] A Bacchelli M DrsquoAmbros M Lanza and R Robbes ldquoBench-marking lightweight techniques to link e-mails and source coderdquoin Reverse Engineering 2009 WCRErsquo09 16th Working Conference onIEEE 2009 pp 205ndash214

[44] S J Pan and Q Yang ldquoA survey on transfer learningrdquo IEEETransactions on knowledge and data engineering vol 22 no 10 pp1345ndash1359 2010

[45] M Oquab L Bottou I Laptev and J Sivic ldquoLearning and transfer-ring mid-level image representations using convolutional neuralnetworksrdquo in Computer Vision and Pattern Recognition (CVPR) 2014IEEE Conference on IEEE 2014 pp 1717ndash1724

[46] S Ravi and H Larochelle ldquoOptimization as a model for few-shotlearningrdquo 2016

[47] A Marcus J Maletic et al ldquoRecovering documentation-to-source-code traceability links using latent semantic indexingrdquo in Soft-ware Engineering 2003 Proceedings 25th International Conference onIEEE 2003 pp 125ndash135

[48] H-Y Jiang T N Nguyen X Chen H Jaygarl and C K ChangldquoIncremental latent semantic indexing for automatic traceabil-ity link evolution managementrdquo in Proceedings of the 2008 23rdIEEEACM International Conference on Automated Software Engineer-ing IEEE Computer Society 2008 pp 59ndash68

[49] W Zheng Q Zhang and M Lyu ldquoCross-library api recommen-dation using web search enginesrdquo in Proceedings of the 19th ACMSIGSOFT symposium and the 13th European conference on Foundationsof software engineering ACM 2011 pp 480ndash483

[50] Q Gao H Zhang J Wang Y Xiong L Zhang and H Mei ldquoFixingrecurring crash bugs via analyzing qampa sites (t)rdquo in AutomatedSoftware Engineering (ASE) 2015 30th IEEEACM International Con-ference on IEEE 2015 pp 307ndash318

[51] C Treude and M P Robillard ldquoAugmenting api documentationwith insights from stack overflowrdquo in Software Engineering (ICSE)2016 IEEEACM 38th International Conference on IEEE 2016 pp392ndash403

[52] P C Rigby and M P Robillard ldquoDiscovering essential codeelements in informal documentationrdquo in Proceedings of the 2013International Conference on Software Engineering IEEE Press 2013pp 832ndash841

[53] L Moonen ldquoGenerating robust parsers using island grammarsrdquoin Reverse Engineering 2001 Proceedings Eighth Working Conferenceon IEEE 2001 pp 13ndash22

[54] A Bacchelli A Cleve M Lanza and A Mocci ldquoExtracting struc-tured data from natural language documents with island parsingrdquo

in Automated Software Engineering (ASE) 2011 26th IEEEACMInternational Conference on IEEE 2011 pp 476ndash479

[55] Y Kim Y Jernite D Sontag and A M Rush ldquoCharacter-awareneural language modelsrdquo in AAAI 2016 pp 2741ndash2749

[56] D Lukovnikov A Fischer J Lehmann and S Auer ldquoNeuralnetwork-based question answering over knowledge graphs onword and character levelrdquo in Proceedings of the 26th internationalconference on World Wide Web International World Wide WebConferences Steering Committee 2017 pp 1211ndash1220

[57] W Ling I Trancoso C Dyer and A W Black ldquoCharacter-basedneural machine translationrdquo arXiv preprint arXiv151104586 2015

[58] X Zhang J Zhao and Y LeCun ldquoCharacter-level convolutionalnetworks for text classificationrdquo in Advances in neural informationprocessing systems 2015 pp 649ndash657

[59] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[60] J P Chiu and E Nichols ldquoNamed entity recognition with bidirec-tional lstm-cnnsrdquo arXiv preprint arXiv151108308 2015

[61] G Lample M Ballesteros S Subramanian K Kawakami andC Dyer ldquoNeural architectures for named entity recognitionrdquoarXiv preprint arXiv160301360 2016

Suyu Ma is a research assistant in the Facultyof Information Technology at Monash UniversityHe has research interest in the areas of soft-ware engineering Deep learning and Human-computer Interaction He is currently focusing onimproving the usability and accessibility of mo-bile application He received the BS and MSdegrees from Beijing Technology and BusinessUniversity and the Australian National Universityin 2016 and 2018 respectively And he will bea PhD student at Monash University in 2020

under the supervision of Chunyang Chen

Zhenchang Xing is a Senior Lecturer in the Re-search School of Computer Science AustralianNational University Previously he was an Assis-tant Professor in the School of Computer Sci-ence and Engineering Nanyang TechnologicalUniversity Singapore from 2012-2016 Beforejoining NTU Dr Xing was a Lee Kuan Yew Re-search Fellow in the School of Computing Na-tional University of Singapore from 2009-2012Dr Xings current research area is in the interdis-plinary areas of software engineering human-

computer interaction and applied data analytics Dr Xing has over 100publications in peer-refereed journals and conference proceedings andhave received several distinguished paper awards from top softwareengineering conferences Dr Xing regularly serves on the organizationand program committees of the top software engineering conferencesand he will be the program committee co-chair for ICSME2020

Chunyang Chen is a lecturer (Assistant Pro-fessor) in Faculty of Information TechnologyMonash University Australia His research fo-cuses on software engineering deep learningand human-computer interaction He has pub-lished over 16 papers in referred journals orconferences including Empirical Software Engi-neering ICSE ASE CSCW ICSME SANERHe is a member of IEEE and ACM He receivedACM SIGSOFT distinguished paper award inASE 2018 best paper award in SANER 2016

and best tool demo in ASE 2016 httpschunyang-chengithubio

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 16

Cheng Chen received his Bachelor degree inSoftware Engineering from Northwest University(China) and received Master degree of Com-puting (major in Artificial Intelligence) from theAustralian National University He did some in-ternship projects about Natural Language Pro-cessing at CSIRO from 2016 to 2017 He workedas a research assistant supervised by Dr Zhen-chang Xing at ANU in 2018 Cheng currentlyworks in PricewaterhouseCoopers (PwC) Firmas a senior algorithm engineer of Natural Lan-

guage Processing and Data Analyzing Cheng is interested in NamedEnti ty Extraction Relation Extraction Text summarization and parallelcomputing He is working on knowledge engineering and transaction forNLP tasks

Lizhen Qu is a research fellow in the DialogueResearch lab in the Faculty of Information Tech-nology at Monash University He has exten-sive research experience in the areas of naturallanguage processing multimodal learning deeplearning and Cybersecurity He is currently fo-cusing on information extraction semantic pars-ing and multimodal dialogue systems Prior tojoining Monash University he worked as a re-search scientist at Data61CSIRO where he ledand participated in several research and indus-

trial projects including Deep Learning for Text and Deep Learning forCyber

Guoqiang Li is now an associate professor inschool of software Shanghai Jiao Tong Univer-sity and a guest associate professor in KyushuUniversity He received the BS MS and PhDdegrees from Taiyuan University of TechnologyShanghai Jiao Tong University and Japan Ad-vanced Institute of Science and Technology in2001 2005 and 2008 respectively He workedas a postdoctoral research fellow in the graduateschool of information science Nagoya Univer-sity Japan during 2008-2009 as an assistant

professor in the school of software Shanghai Jiao Tong Universityduring 2009-2013 and as an academic visitor in the department ofcomputer science University of Oxford during 2015-2016 His researchinterests include formal verification programming language theory anddata analytics and intelligence He published more than 40 researchespapers in the international journals and mainstream conferences includ-ing TDSC SPE TECS IJFCS SCIS CSCW ICSE FORMATS ATVAetc

Page 14: JOURNAL OF LA Easy-to-Deploy API Extraction by Multi-Level ... · sufficient high-quality training data for APIs of some less frequently discussed libraries or frameworks. Another

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 14

8 CONCULSION

This paper presents a novel multi-layer neural architecturethat can effectively learn important character- word- andsentence-level features from informal software texts for ex-tracting API mentions in text The learned features has supe-rior performance than human-defined orthographic featuresfor API extraction in informal software texts This makes ourneural architecture easy to deploy because the only input itrequires is the texts to be processed In contrast existingmachine learning based API extraction methods have touse additional hand-crafted features such as word clustersor API gazetteers in order to achieve the performanceclose to that of our neural architecture Furthermore as thefeatures are automatically learned from the input texts ourneural architecture is easy to transfer and fine-tune acrossprogramming languages and libraries We demonstrate itstransferability across three Python libraries and across fourprogramming languages Our neural architecture togetherwith transfer learning makes it easy to train and deploya high-quality API extraction model for multiple program-ming languages and libraries with much less overall effortrequired for preparing training data and effective featuresIn the future we will further investigate the performanceand the transferability of our neural architecture in manyother programming languages and libraries moving to-wards real-world deployment of machine learning basedAPI extraction methods

REFERENCES

[1] C Parnin C Treude L Grammel and M-A Storey ldquoCrowddocumentation Exploring the coverage and the dynamics of apidiscussions on stack overflowrdquo

[2] D Ye Z Xing C Y Foo J Li and N Kapre ldquoLearning to extractapi mentions from informal natural language discussionsrdquo in Soft-ware Maintenance and Evolution (ICSME) 2016 IEEE InternationalConference on IEEE 2016 pp 389ndash399

[3] B Dagenais and M P Robillard ldquoRecovering traceability linksbetween an api and its learning resourcesrdquo in Software Engineering(ICSE) 2012 34th International Conference on IEEE 2012 pp 47ndash57

[4] S Subramanian L Inozemtseva and R Holmes ldquoLive api doc-umentationrdquo in Proceedings of the 36th International Conference onSoftware Engineering ACM 2014 pp 643ndash652

[5] C Chen Z Xing and X Wang ldquoUnsupervised software-specificmorphological forms inference from informal discussionsrdquo in Pro-ceedings of the 39th International Conference on Software EngineeringIEEE Press 2017 pp 450ndash461

[6] X Chen C Chen D Zhang and Z Xing ldquoSethesaurus Wordnetin software engineeringrdquo IEEE Transactions on Software Engineer-ing 2019

[7] H Li S Li J Sun Z Xing X Peng M Liu and X Zhao ldquoIm-proving api caveats accessibility by mining api caveats knowledgegraphrdquo in Software Maintenance and Evolution (ICSME) 2018 IEEEInternational Conference on IEEE 2018

[8] Z X D L X W Qiao Huang Xin Xia ldquoApi method recom-mendation without worrying about the task-api knowledge gaprdquoin Automated Software Engineering (ASE) 2018 33th IEEEACMInternational Conference on IEEE 2018

[9] B Xu Z Xing X Xia and D Lo ldquoAnswerbot automated gen-eration of answer summary to developersz technical questionsrdquoin Proceedings of the 32nd IEEEACM International Conference onAutomated Software Engineering IEEE Press 2017 pp 706ndash716

[10] D Ye Z Xing C Y Foo Z Q Ang J Li and N Kapre ldquoSoftware-specific named entity recognition in software engineering socialcontentrdquo in 2016 IEEE 23rd International Conference on SoftwareAnalysis Evolution and Reengineering (SANER) vol 1 IEEE 2016pp 90ndash101

[11] E F Tjong Kim Sang and F De Meulder ldquoIntroduction to the conll-2003 shared task Language-independent named entity recogni-tionrdquo in Proceedings of the seventh conference on Natural languagelearning at HLT-NAACL 2003-Volume 4 Association for Compu-tational Linguistics 2003 pp 142ndash147

[12] R Caruana ldquoLearning many related tasks at the same time withbackpropagationrdquo in Advances in neural information processing sys-tems 1995 pp 657ndash664

[13] A Arnold R Nallapati and W W Cohen ldquoExploiting featurehierarchy for transfer learning in named entity recognitionrdquo Pro-ceedings of ACL-08 HLT pp 245ndash253 2008

[14] J Yosinski J Clune Y Bengio and H Lipson ldquoHow transferableare features in deep neural networksrdquo in Advances in neuralinformation processing systems 2014 pp 3320ndash3328

[15] Z X Deheng Ye Lingfeng Bao and S-W Lin ldquoApireal An apirecognition and linking approach for online developer forumsrdquoEmpirical Software Engineering 2018 2018

[16] C dos Santos and M Gatti ldquoDeep convolutional neural networksfor sentiment analysis of short textsrdquo in Proceedings of COLING2014 the 25th International Conference on Computational LinguisticsTechnical Papers 2014 pp 69ndash78

[17] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[18] G Chen C Chen Z Xing and B Xu ldquoLearning a dual-languagevector space for domain-specific cross-lingual question retrievalrdquoin 2016 31st IEEEACM International Conference on Automated Soft-ware Engineering (ASE) IEEE 2016 pp 744ndash755

[19] C Chen X Chen J Sun Z Xing and G Li ldquoData-driven proac-tive policy assurance of post quality in community qampa sitesrdquoProceedings of the ACM on human-computer interaction vol 2 noCSCW p 33 2018

[20] I Bazzi ldquoModelling out-of-vocabulary words for robust speechrecognitionrdquo PhD dissertation Massachusetts Institute of Tech-nology 2002

[21] C Chen S Gao and Z Xing ldquoMining analogical libraries in qampadiscussionsndashincorporating relational and categorical knowledgeinto word embeddingrdquo in 2016 IEEE 23rd international conferenceon software analysis evolution and reengineering (SANER) vol 1IEEE 2016 pp 338ndash348

[22] C Chen and Z Xing ldquoSimilartech automatically recommendanalogical libraries across different programming languagesrdquo in2016 31st IEEEACM International Conference on Automated SoftwareEngineering (ASE) IEEE 2016 pp 834ndash839

[23] C Chen Z Xing and Y Liu ldquoWhats spains paris mining analogi-cal libraries from qampa discussionsrdquo Empirical Software Engineeringvol 24 no 3 pp 1155ndash1194 2019

[24] Y Huang C Chen Z Xing T Lin and Y Liu ldquoTell them apartdistilling technology differences from crowd-scale comparisondiscussionsrdquo in ASE 2018 pp 214ndash224

[25] O Levy and Y Goldberg ldquoNeural word embedding as implicitmatrix factorizationrdquo in Advances in neural information processingsystems 2014 pp 2177ndash2185

[26] D Tang F Wei N Yang M Zhou T Liu and B Qin ldquoLearningsentiment-specific word embedding for twitter sentiment classifi-cationrdquo in Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) vol 1 2014pp 1555ndash1565

[27] O Levy Y Goldberg and I Dagan ldquoImproving distributional sim-ilarity with lessons learned from word embeddingsrdquo Transactionsof the Association for Computational Linguistics vol 3 pp 211ndash2252015

[28] J Pennington R Socher and C Manning ldquoGlove Global vectorsfor word representationrdquo in Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP) 2014 pp1532ndash1543

[29] I Sutskever O Vinyals and Q V Le ldquoSequence to sequencelearning with neural networksrdquo in Advances in neural informationprocessing systems 2014 pp 3104ndash3112

[30] C Chen Z Xing and Y Liu ldquoBy the community amp for the com-munity a deep learning approach to assist collaborative editing inqampa sitesrdquo Proceedings of the ACM on Human-Computer Interactionvol 1 no CSCW p 32 2017

[31] S Gao C Chen Z Xing Y Ma W Song and S-W Lin ldquoA neuralmodel for method name generation from functional descriptionrdquo

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 15

in 2019 IEEE 26th International Conference on Software AnalysisEvolution and Reengineering (SANER) IEEE 2019 pp 414ndash421

[32] X Wang C Chen and Z Xing ldquoDomain-specific machine trans-lation with recurrent neural network for software localizationrdquoEmpirical Software Engineering pp 1ndash32 2019

[33] C Chen Z Xing Y Liu and K L X Ong ldquoMining likelyanalogical apis across third-party libraries via large-scale unsu-pervised api semantics embeddingrdquo IEEE Transactions on SoftwareEngineering 2019

[34] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no 11pp 2673ndash2681 1997

[35] Z Huang W Xu and K Yu ldquoBidirectional lstm-crf models forsequence taggingrdquo arXiv preprint arXiv150801991 2015

[36] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural computation vol 9 no 8 pp 1735ndash1780 1997

[37] H Sak A Senior and F Beaufays ldquoLong short-term memoryrecurrent neural network architectures for large scale acousticmodelingrdquo in Fifteenth annual conference of the international speechcommunication association 2014

[38] Y Bengio ldquoDeep learning of representations for unsupervised andtransfer learningrdquo in Proceedings of ICML Workshop on Unsupervisedand Transfer Learning 2012 pp 17ndash36

[39] S J Pan Q Yang et al ldquoA survey on transfer learningrdquo[40] N Srivastava G Hinton A Krizhevsky I Sutskever and

R Salakhutdinov ldquoDropout A simple way to prevent neural net-works from overfittingrdquo The Journal of Machine Learning Researchvol 15 no 1 pp 1929ndash1958 2014

[41] D P Kingma and J Ba ldquoAdam A method for stochastic optimiza-tionrdquo arXiv preprint arXiv14126980 2014

[42] A Bacchelli M Lanza and R Robbes ldquoLinking e-mails andsource code artifactsrdquo in Proceedings of the 32nd ACMIEEE Inter-national Conference on Software Engineering-Volume 1 ACM 2010pp 375ndash384

[43] A Bacchelli M DrsquoAmbros M Lanza and R Robbes ldquoBench-marking lightweight techniques to link e-mails and source coderdquoin Reverse Engineering 2009 WCRErsquo09 16th Working Conference onIEEE 2009 pp 205ndash214

[44] S J Pan and Q Yang ldquoA survey on transfer learningrdquo IEEETransactions on knowledge and data engineering vol 22 no 10 pp1345ndash1359 2010

[45] M Oquab L Bottou I Laptev and J Sivic ldquoLearning and transfer-ring mid-level image representations using convolutional neuralnetworksrdquo in Computer Vision and Pattern Recognition (CVPR) 2014IEEE Conference on IEEE 2014 pp 1717ndash1724

[46] S Ravi and H Larochelle ldquoOptimization as a model for few-shotlearningrdquo 2016

[47] A Marcus J Maletic et al ldquoRecovering documentation-to-source-code traceability links using latent semantic indexingrdquo in Soft-ware Engineering 2003 Proceedings 25th International Conference onIEEE 2003 pp 125ndash135

[48] H-Y Jiang T N Nguyen X Chen H Jaygarl and C K ChangldquoIncremental latent semantic indexing for automatic traceabil-ity link evolution managementrdquo in Proceedings of the 2008 23rdIEEEACM International Conference on Automated Software Engineer-ing IEEE Computer Society 2008 pp 59ndash68

[49] W Zheng Q Zhang and M Lyu ldquoCross-library api recommen-dation using web search enginesrdquo in Proceedings of the 19th ACMSIGSOFT symposium and the 13th European conference on Foundationsof software engineering ACM 2011 pp 480ndash483

[50] Q Gao H Zhang J Wang Y Xiong L Zhang and H Mei ldquoFixingrecurring crash bugs via analyzing qampa sites (t)rdquo in AutomatedSoftware Engineering (ASE) 2015 30th IEEEACM International Con-ference on IEEE 2015 pp 307ndash318

[51] C Treude and M P Robillard ldquoAugmenting api documentationwith insights from stack overflowrdquo in Software Engineering (ICSE)2016 IEEEACM 38th International Conference on IEEE 2016 pp392ndash403

[52] P C Rigby and M P Robillard ldquoDiscovering essential codeelements in informal documentationrdquo in Proceedings of the 2013International Conference on Software Engineering IEEE Press 2013pp 832ndash841

[53] L Moonen ldquoGenerating robust parsers using island grammarsrdquoin Reverse Engineering 2001 Proceedings Eighth Working Conferenceon IEEE 2001 pp 13ndash22

[54] A Bacchelli A Cleve M Lanza and A Mocci ldquoExtracting struc-tured data from natural language documents with island parsingrdquo

in Automated Software Engineering (ASE) 2011 26th IEEEACMInternational Conference on IEEE 2011 pp 476ndash479

[55] Y Kim Y Jernite D Sontag and A M Rush ldquoCharacter-awareneural language modelsrdquo in AAAI 2016 pp 2741ndash2749

[56] D Lukovnikov A Fischer J Lehmann and S Auer ldquoNeuralnetwork-based question answering over knowledge graphs onword and character levelrdquo in Proceedings of the 26th internationalconference on World Wide Web International World Wide WebConferences Steering Committee 2017 pp 1211ndash1220

[57] W Ling I Trancoso C Dyer and A W Black ldquoCharacter-basedneural machine translationrdquo arXiv preprint arXiv151104586 2015

[58] X Zhang J Zhao and Y LeCun ldquoCharacter-level convolutionalnetworks for text classificationrdquo in Advances in neural informationprocessing systems 2015 pp 649ndash657

[59] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[60] J P Chiu and E Nichols ldquoNamed entity recognition with bidirec-tional lstm-cnnsrdquo arXiv preprint arXiv151108308 2015

[61] G Lample M Ballesteros S Subramanian K Kawakami andC Dyer ldquoNeural architectures for named entity recognitionrdquoarXiv preprint arXiv160301360 2016

Suyu Ma is a research assistant in the Facultyof Information Technology at Monash UniversityHe has research interest in the areas of soft-ware engineering Deep learning and Human-computer Interaction He is currently focusing onimproving the usability and accessibility of mo-bile application He received the BS and MSdegrees from Beijing Technology and BusinessUniversity and the Australian National Universityin 2016 and 2018 respectively And he will bea PhD student at Monash University in 2020

under the supervision of Chunyang Chen

Zhenchang Xing is a Senior Lecturer in the Re-search School of Computer Science AustralianNational University Previously he was an Assis-tant Professor in the School of Computer Sci-ence and Engineering Nanyang TechnologicalUniversity Singapore from 2012-2016 Beforejoining NTU Dr Xing was a Lee Kuan Yew Re-search Fellow in the School of Computing Na-tional University of Singapore from 2009-2012Dr Xings current research area is in the interdis-plinary areas of software engineering human-

computer interaction and applied data analytics Dr Xing has over 100publications in peer-refereed journals and conference proceedings andhave received several distinguished paper awards from top softwareengineering conferences Dr Xing regularly serves on the organizationand program committees of the top software engineering conferencesand he will be the program committee co-chair for ICSME2020

Chunyang Chen is a lecturer (Assistant Pro-fessor) in Faculty of Information TechnologyMonash University Australia His research fo-cuses on software engineering deep learningand human-computer interaction He has pub-lished over 16 papers in referred journals orconferences including Empirical Software Engi-neering ICSE ASE CSCW ICSME SANERHe is a member of IEEE and ACM He receivedACM SIGSOFT distinguished paper award inASE 2018 best paper award in SANER 2016

and best tool demo in ASE 2016 httpschunyang-chengithubio

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 16

Cheng Chen received his Bachelor degree inSoftware Engineering from Northwest University(China) and received Master degree of Com-puting (major in Artificial Intelligence) from theAustralian National University He did some in-ternship projects about Natural Language Pro-cessing at CSIRO from 2016 to 2017 He workedas a research assistant supervised by Dr Zhen-chang Xing at ANU in 2018 Cheng currentlyworks in PricewaterhouseCoopers (PwC) Firmas a senior algorithm engineer of Natural Lan-

guage Processing and Data Analyzing Cheng is interested in NamedEnti ty Extraction Relation Extraction Text summarization and parallelcomputing He is working on knowledge engineering and transaction forNLP tasks

Lizhen Qu is a research fellow in the DialogueResearch lab in the Faculty of Information Tech-nology at Monash University He has exten-sive research experience in the areas of naturallanguage processing multimodal learning deeplearning and Cybersecurity He is currently fo-cusing on information extraction semantic pars-ing and multimodal dialogue systems Prior tojoining Monash University he worked as a re-search scientist at Data61CSIRO where he ledand participated in several research and indus-

trial projects including Deep Learning for Text and Deep Learning forCyber

Guoqiang Li is now an associate professor inschool of software Shanghai Jiao Tong Univer-sity and a guest associate professor in KyushuUniversity He received the BS MS and PhDdegrees from Taiyuan University of TechnologyShanghai Jiao Tong University and Japan Ad-vanced Institute of Science and Technology in2001 2005 and 2008 respectively He workedas a postdoctoral research fellow in the graduateschool of information science Nagoya Univer-sity Japan during 2008-2009 as an assistant

professor in the school of software Shanghai Jiao Tong Universityduring 2009-2013 and as an academic visitor in the department ofcomputer science University of Oxford during 2015-2016 His researchinterests include formal verification programming language theory anddata analytics and intelligence He published more than 40 researchespapers in the international journals and mainstream conferences includ-ing TDSC SPE TECS IJFCS SCIS CSCW ICSE FORMATS ATVAetc

Page 15: JOURNAL OF LA Easy-to-Deploy API Extraction by Multi-Level ... · sufficient high-quality training data for APIs of some less frequently discussed libraries or frameworks. Another

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 15

in 2019 IEEE 26th International Conference on Software AnalysisEvolution and Reengineering (SANER) IEEE 2019 pp 414ndash421

[32] X Wang C Chen and Z Xing ldquoDomain-specific machine trans-lation with recurrent neural network for software localizationrdquoEmpirical Software Engineering pp 1ndash32 2019

[33] C Chen Z Xing Y Liu and K L X Ong ldquoMining likelyanalogical apis across third-party libraries via large-scale unsu-pervised api semantics embeddingrdquo IEEE Transactions on SoftwareEngineering 2019

[34] M Schuster and K K Paliwal ldquoBidirectional recurrent neuralnetworksrdquo IEEE Transactions on Signal Processing vol 45 no 11pp 2673ndash2681 1997

[35] Z Huang W Xu and K Yu ldquoBidirectional lstm-crf models forsequence taggingrdquo arXiv preprint arXiv150801991 2015

[36] S Hochreiter and J Schmidhuber ldquoLong short-term memoryrdquoNeural computation vol 9 no 8 pp 1735ndash1780 1997

[37] H Sak A Senior and F Beaufays ldquoLong short-term memoryrecurrent neural network architectures for large scale acousticmodelingrdquo in Fifteenth annual conference of the international speechcommunication association 2014

[38] Y Bengio ldquoDeep learning of representations for unsupervised andtransfer learningrdquo in Proceedings of ICML Workshop on Unsupervisedand Transfer Learning 2012 pp 17ndash36

[39] S J Pan Q Yang et al ldquoA survey on transfer learningrdquo[40] N Srivastava G Hinton A Krizhevsky I Sutskever and

R Salakhutdinov ldquoDropout A simple way to prevent neural net-works from overfittingrdquo The Journal of Machine Learning Researchvol 15 no 1 pp 1929ndash1958 2014

[41] D P Kingma and J Ba ldquoAdam A method for stochastic optimiza-tionrdquo arXiv preprint arXiv14126980 2014

[42] A Bacchelli M Lanza and R Robbes ldquoLinking e-mails andsource code artifactsrdquo in Proceedings of the 32nd ACMIEEE Inter-national Conference on Software Engineering-Volume 1 ACM 2010pp 375ndash384

[43] A Bacchelli M DrsquoAmbros M Lanza and R Robbes ldquoBench-marking lightweight techniques to link e-mails and source coderdquoin Reverse Engineering 2009 WCRErsquo09 16th Working Conference onIEEE 2009 pp 205ndash214

[44] S J Pan and Q Yang ldquoA survey on transfer learningrdquo IEEETransactions on knowledge and data engineering vol 22 no 10 pp1345ndash1359 2010

[45] M Oquab L Bottou I Laptev and J Sivic ldquoLearning and transfer-ring mid-level image representations using convolutional neuralnetworksrdquo in Computer Vision and Pattern Recognition (CVPR) 2014IEEE Conference on IEEE 2014 pp 1717ndash1724

[46] S Ravi and H Larochelle ldquoOptimization as a model for few-shotlearningrdquo 2016

[47] A Marcus J Maletic et al ldquoRecovering documentation-to-source-code traceability links using latent semantic indexingrdquo in Soft-ware Engineering 2003 Proceedings 25th International Conference onIEEE 2003 pp 125ndash135

[48] H-Y Jiang T N Nguyen X Chen H Jaygarl and C K ChangldquoIncremental latent semantic indexing for automatic traceabil-ity link evolution managementrdquo in Proceedings of the 2008 23rdIEEEACM International Conference on Automated Software Engineer-ing IEEE Computer Society 2008 pp 59ndash68

[49] W Zheng Q Zhang and M Lyu ldquoCross-library api recommen-dation using web search enginesrdquo in Proceedings of the 19th ACMSIGSOFT symposium and the 13th European conference on Foundationsof software engineering ACM 2011 pp 480ndash483

[50] Q Gao H Zhang J Wang Y Xiong L Zhang and H Mei ldquoFixingrecurring crash bugs via analyzing qampa sites (t)rdquo in AutomatedSoftware Engineering (ASE) 2015 30th IEEEACM International Con-ference on IEEE 2015 pp 307ndash318

[51] C Treude and M P Robillard ldquoAugmenting api documentationwith insights from stack overflowrdquo in Software Engineering (ICSE)2016 IEEEACM 38th International Conference on IEEE 2016 pp392ndash403

[52] P C Rigby and M P Robillard ldquoDiscovering essential codeelements in informal documentationrdquo in Proceedings of the 2013International Conference on Software Engineering IEEE Press 2013pp 832ndash841

[53] L Moonen ldquoGenerating robust parsers using island grammarsrdquoin Reverse Engineering 2001 Proceedings Eighth Working Conferenceon IEEE 2001 pp 13ndash22

[54] A Bacchelli A Cleve M Lanza and A Mocci ldquoExtracting struc-tured data from natural language documents with island parsingrdquo

in Automated Software Engineering (ASE) 2011 26th IEEEACMInternational Conference on IEEE 2011 pp 476ndash479

[55] Y Kim Y Jernite D Sontag and A M Rush ldquoCharacter-awareneural language modelsrdquo in AAAI 2016 pp 2741ndash2749

[56] D Lukovnikov A Fischer J Lehmann and S Auer ldquoNeuralnetwork-based question answering over knowledge graphs onword and character levelrdquo in Proceedings of the 26th internationalconference on World Wide Web International World Wide WebConferences Steering Committee 2017 pp 1211ndash1220

[57] W Ling I Trancoso C Dyer and A W Black ldquoCharacter-basedneural machine translationrdquo arXiv preprint arXiv151104586 2015

[58] X Zhang J Zhao and Y LeCun ldquoCharacter-level convolutionalnetworks for text classificationrdquo in Advances in neural informationprocessing systems 2015 pp 649ndash657

[59] C D Santos and B Zadrozny ldquoLearning character-level repre-sentations for part-of-speech taggingrdquo in Proceedings of the 31stInternational Conference on Machine Learning (ICML-14) 2014 pp1818ndash1826

[60] J P Chiu and E Nichols ldquoNamed entity recognition with bidirec-tional lstm-cnnsrdquo arXiv preprint arXiv151108308 2015

[61] G Lample M Ballesteros S Subramanian K Kawakami andC Dyer ldquoNeural architectures for named entity recognitionrdquoarXiv preprint arXiv160301360 2016

Suyu Ma is a research assistant in the Facultyof Information Technology at Monash UniversityHe has research interest in the areas of soft-ware engineering Deep learning and Human-computer Interaction He is currently focusing onimproving the usability and accessibility of mo-bile application He received the BS and MSdegrees from Beijing Technology and BusinessUniversity and the Australian National Universityin 2016 and 2018 respectively And he will bea PhD student at Monash University in 2020

under the supervision of Chunyang Chen

Zhenchang Xing is a Senior Lecturer in the Re-search School of Computer Science AustralianNational University Previously he was an Assis-tant Professor in the School of Computer Sci-ence and Engineering Nanyang TechnologicalUniversity Singapore from 2012-2016 Beforejoining NTU Dr Xing was a Lee Kuan Yew Re-search Fellow in the School of Computing Na-tional University of Singapore from 2009-2012Dr Xings current research area is in the interdis-plinary areas of software engineering human-

computer interaction and applied data analytics Dr Xing has over 100publications in peer-refereed journals and conference proceedings andhave received several distinguished paper awards from top softwareengineering conferences Dr Xing regularly serves on the organizationand program committees of the top software engineering conferencesand he will be the program committee co-chair for ICSME2020

Chunyang Chen is a lecturer (Assistant Pro-fessor) in Faculty of Information TechnologyMonash University Australia His research fo-cuses on software engineering deep learningand human-computer interaction He has pub-lished over 16 papers in referred journals orconferences including Empirical Software Engi-neering ICSE ASE CSCW ICSME SANERHe is a member of IEEE and ACM He receivedACM SIGSOFT distinguished paper award inASE 2018 best paper award in SANER 2016

and best tool demo in ASE 2016 httpschunyang-chengithubio

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 16

Cheng Chen received his Bachelor degree inSoftware Engineering from Northwest University(China) and received Master degree of Com-puting (major in Artificial Intelligence) from theAustralian National University He did some in-ternship projects about Natural Language Pro-cessing at CSIRO from 2016 to 2017 He workedas a research assistant supervised by Dr Zhen-chang Xing at ANU in 2018 Cheng currentlyworks in PricewaterhouseCoopers (PwC) Firmas a senior algorithm engineer of Natural Lan-

guage Processing and Data Analyzing Cheng is interested in NamedEnti ty Extraction Relation Extraction Text summarization and parallelcomputing He is working on knowledge engineering and transaction forNLP tasks

Lizhen Qu is a research fellow in the DialogueResearch lab in the Faculty of Information Tech-nology at Monash University He has exten-sive research experience in the areas of naturallanguage processing multimodal learning deeplearning and Cybersecurity He is currently fo-cusing on information extraction semantic pars-ing and multimodal dialogue systems Prior tojoining Monash University he worked as a re-search scientist at Data61CSIRO where he ledand participated in several research and indus-

trial projects including Deep Learning for Text and Deep Learning forCyber

Guoqiang Li is now an associate professor inschool of software Shanghai Jiao Tong Univer-sity and a guest associate professor in KyushuUniversity He received the BS MS and PhDdegrees from Taiyuan University of TechnologyShanghai Jiao Tong University and Japan Ad-vanced Institute of Science and Technology in2001 2005 and 2008 respectively He workedas a postdoctoral research fellow in the graduateschool of information science Nagoya Univer-sity Japan during 2008-2009 as an assistant

professor in the school of software Shanghai Jiao Tong Universityduring 2009-2013 and as an academic visitor in the department ofcomputer science University of Oxford during 2015-2016 His researchinterests include formal verification programming language theory anddata analytics and intelligence He published more than 40 researchespapers in the international journals and mainstream conferences includ-ing TDSC SPE TECS IJFCS SCIS CSCW ICSE FORMATS ATVAetc

Page 16: JOURNAL OF LA Easy-to-Deploy API Extraction by Multi-Level ... · sufficient high-quality training data for APIs of some less frequently discussed libraries or frameworks. Another

JOURNAL OF LATEX CLASS FILES VOL 14 NO 8 AUGUST 2015 16

Cheng Chen received his Bachelor degree inSoftware Engineering from Northwest University(China) and received Master degree of Com-puting (major in Artificial Intelligence) from theAustralian National University He did some in-ternship projects about Natural Language Pro-cessing at CSIRO from 2016 to 2017 He workedas a research assistant supervised by Dr Zhen-chang Xing at ANU in 2018 Cheng currentlyworks in PricewaterhouseCoopers (PwC) Firmas a senior algorithm engineer of Natural Lan-

guage Processing and Data Analyzing Cheng is interested in NamedEnti ty Extraction Relation Extraction Text summarization and parallelcomputing He is working on knowledge engineering and transaction forNLP tasks

Lizhen Qu is a research fellow in the DialogueResearch lab in the Faculty of Information Tech-nology at Monash University He has exten-sive research experience in the areas of naturallanguage processing multimodal learning deeplearning and Cybersecurity He is currently fo-cusing on information extraction semantic pars-ing and multimodal dialogue systems Prior tojoining Monash University he worked as a re-search scientist at Data61CSIRO where he ledand participated in several research and indus-

trial projects including Deep Learning for Text and Deep Learning forCyber

Guoqiang Li is now an associate professor inschool of software Shanghai Jiao Tong Univer-sity and a guest associate professor in KyushuUniversity He received the BS MS and PhDdegrees from Taiyuan University of TechnologyShanghai Jiao Tong University and Japan Ad-vanced Institute of Science and Technology in2001 2005 and 2008 respectively He workedas a postdoctoral research fellow in the graduateschool of information science Nagoya Univer-sity Japan during 2008-2009 as an assistant

professor in the school of software Shanghai Jiao Tong Universityduring 2009-2013 and as an academic visitor in the department ofcomputer science University of Oxford during 2015-2016 His researchinterests include formal verification programming language theory anddata analytics and intelligence He published more than 40 researchespapers in the international journals and mainstream conferences includ-ing TDSC SPE TECS IJFCS SCIS CSCW ICSE FORMATS ATVAetc


Recommended