Date post: | 29-Jun-2015 |
Category: |
Technology |
Upload: | valerio-maggio |
View: | 130 times |
Download: | 0 times |
LINSEN AN EFFICIENT APPROACH TO SPLIT IDENTIFIERS AND EXPAND ABBREVIATIONS
Anna Corazza, Sergio Di Martino, Valerio MaggioUniversità di Napoli “Federico II”
26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy
MOTIVATIONS IR FOR SE
MOTIVATIONS IR FOR SE
IR F
OR
NATU
RAL
LAN
GU
AG
E
1. Tokenization
IR F
OR
NATU
RAL
LAN
GU
AG
E
1. Tokenization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
1. Tokenization
2. Normalization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
1. Tokenization
2. Normalization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
1. Change to Lower case
draws, the, are, nullhandle, box, r, rectangle, g, graphics, box, displaybox, ...
1. Tokenization
2. Normalization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
1. Change to Lower case
draws, the, are, nullhandle, box, r, rectangle, g, graphics, box, displaybox, ...2.Remove StopWords
1. Tokenization
2. Normalization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
1. Change to Lower case
draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords
3.Apply Stemming4. ...
Implicit assumption: The “same” words are used whenever a particular concept is described
1. Tokenization
2. Normalization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
1. Change to Lower case
draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords
3.Apply Stemming4. ...
Implicit assumption: The “same” words are used whenever a particular concept is described
1. Tokenization
2. Normalization
IR F
OR
NATU
RAL
LAN
GU
AG
E
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
1. Change to Lower case
draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords
3.Apply Stemming4. ...
1. Tokenization
IR F
OR
SOUR
CE
CO
DE
2. Normalization1.5 Identifier Splitting
1. Tokenization
IR F
OR
SOUR
CE
CO
DE
2. Normalization1.5 Identifier Splitting
• snake_case Splitter: r’(?<=\w)_’
• display_box ==> display | box
• camelCase/PascalCase Splitter: r’(?<!^)([A-Z][a-z]+)’
• displayBox ==> display | Box
draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...
IR F
OR
SOUR
CE
CO
DE
IR F
OR
SOUR
CE
CO
DE
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• drawXORRect ==> drawXOR | Rect
• drawxorrect ==> NO SPLIT
IR F
OR
SOUR
CE
CO
DE
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• drawXORRect ==> drawXOR | Rect
• drawxorrect ==> NO SPLIT
IR F
OR
SOUR
CE
CO
DE
Splitting algorithms based on naming conventions are not robust enough
Splitting algorithms based on naming conventions are not robust enough
• Heavy use of Abbreviations in the source code
IR F
OR
SOUR
CE
CO
DE
Splitting algorithms based on naming conventions are not robust enough
• Heavy use of Abbreviations in the source code
• rect as for Rectangle
• r as for Rectangle
IR F
OR
SOUR
CE
CO
DE
Splitting algorithms based on naming conventions are not robust enough
• Heavy use of Abbreviations in the source code
• rect as for Rectangle
• r as for Rectangle
IR F
OR
SOUR
CE
CO
DE
1. Tokenization
IDEN
TIFIE
R M
AP
PIN
G
2. Normalization1.5 Identifier Mapping
• SAMURAI (Enslen, et.al , 2011) • TIDIER (Guerrouj, et.al , 2011)
• GenTest+Normalize (Lawrie and Binkley, 2011)
• AMAP (Hill and Pollock, 2008)
• ...• LINSEN
draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...
LINSENA L G O
R I T H M CONTRIBUTION• Novel technique for the Identifier Mapping
LINSENA L G O
R I T H M CONTRIBUTION• Novel technique for the Identifier Mapping
• Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP)
LINSENA L G O
R I T H M CONTRIBUTION• Novel technique for the Identifier Mapping
• Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP)
• Applied on a Graph-based model
LINSENA L G O
R I T H M CONTRIBUTION• Novel technique for the Identifier Mapping
• Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP)
• Applied on a Graph-based model
• Able to both Split Identifiers and Expand possible occurring abbreviations
There could be multiple and equally correct splitting or expansion solutions
THE
AM
BIG
UIT
Y P
ROBL
EM
There could be multiple and equally correct splitting or expansion solutions
• r as for Rectangle OR red
THE
AM
BIG
UIT
Y P
ROBL
EM
There could be multiple and equally correct splitting or expansion solutions
• r as for Rectangle OR red
THE
AM
BIG
UIT
Y P
ROBL
EM
•nsISupport ==> ns|IS|up|ports OR
==> ns|I|Supports
DICTIONARIES
DICTIONARIES
DICTIONARIES
Application-aware Dictionaries
DICTIONARIES
Application-aware Dictionaries(108,315 Entries)
DICTIONARIES
Application-aware Dictionaries
(22,940 Entries)
(108,315 Entries)
DICTIONARIES
Application-aware Dictionaries
(22,940 Entries)
(108,315 Entries)
(588 Entries)
GR
AP
H M
ODE
L Model: Weighted Directed Graph Example: drawXORRect identifier
GR
AP
H M
ODE
L
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
5
7
8
2 63
10
GR
AP
H M
ODE
L
• ARCS corresponds to matchings between identifier substrings and dictionary words
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
5
7
8
2 63
10
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
GR
AP
H M
ODE
L
• ARCS corresponds to matchings between identifier substrings and dictionary words
• Application of the String Matching Algorithm (BYP)
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
5
7
8
2 63
10
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
GR
AP
H M
ODE
L
• ARCS corresponds to matchings between identifier substrings and dictionary words
• Application of the String Matching Algorithm (BYP)
• Padding Arcs to ensure the Graph always connected
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3
“w”,C-MAX “X”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
“R”,C-MAX
GR
AP
H M
ODE
L
• Every Arc is Labelled with the corresponding dictionary word
• Weights represent the “cost” of each matching
• Cost function [c(“word”)] favors longest words and words coming from the application-aware dictionaries
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3
“w”,C-MAX “X”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
“R”,C-MAX
GR
AP
H M
ODE
L
• The final Mapping Solution corresponds to the sequence of labels in the
path with the minimum cost (Djikstra Algorithm)
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3
“w”,C-MAX “X”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
“R”,C-MAX
STRING MATCHING• Application of the Baeza-Yates and Perlberg (BYP) Algorithm
• Signature: BYP(identifier, word, φ(word))
• identifier: target string
• word: string to match
• φ(·): Tolerance (Error) function
• Bounds the length of acceptable matchings
Advantage: Use the same algorithm for both the splitting and the expansion step with different input Tolerance function
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
.. draw, the, are, null, handle, box, red, rectangle, ...
... echo, testing, threading, xpm, xor, ....
... abort absolute abstract ... or ... raw ...
DFile DCOMPUTER-SCIENCE DEnglish
Identifier:drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
.. draw, the, are, null, handle, box, red, rectangle, ...
... echo, testing, threading, xpm, xor, ....
... abort absolute abstract ... or ... raw ...
0
11
1
9
4
5
7
8
2 63
10
DFile DCOMPUTER-SCIENCE DEnglish
Identifier:drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
.. draw, the, are, null, handle, box, red, rectangle, ...
... echo, testing, threading, xpm, xor, ....
... abort absolute abstract ... or ... raw ...
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX “R”,C-MAX
DFile DCOMPUTER-SCIENCE DEnglish
Identifier:drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
... echo, testing, threading, xpm, xor, ....
... abort absolute abstract ... or ... raw ...
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
DFile DCOMPUTER-SCIENCE DEnglish
Identifier:drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
... abort absolute abstract ... or ... raw ...
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
DFile ... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE DEnglish
Identifier:drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
DFile ... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ... or ...
raw ...
DEnglish
Identifier:drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
DFile ... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglish
Identifier:drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)B
YP
FO
R SP
LITT
ING
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
DFile ... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglish
Identifier:drawXORRect
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BY
P F
OR
EXPA
NSI
ON
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglishDFile
Identifier:drawXORRect
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BY
P F
OR
EXPA
NSI
ON
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
..
draw, the, are, null, handle, box, red, rectangle, ...
... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglishDFile
Identifier:drawXORRect
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BY
P F
OR
EXPA
NSI
ON
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglish
“red”,c(“red”)
..
draw, the, are, null, handle, box,
red, rectangle, ...
DFile
Identifier:drawXORRect
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BY
P F
OR
EXPA
NSI
ON
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglish
“red”,c(“red”)“rectangle”,c(“
rectangle”)
..
draw, the, are, null, handle, box,
red,
rectangle, ...
DFile
Identifier:drawXORRect
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BY
P F
OR
EXPA
NSI
ON
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3“w”,C-MAX “X
”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
... echo, testing, threading, xpm,
xor, ....
DCOMPUTER-SCIENCE ... abort absolute abstract ...
or ...
raw ...
DEnglish
“red”,c(“red”)“rectangle”,c(“
rectangle”)
..
draw, the, are, null, handle, box,
red,
rectangle, ...
DFile
Identifier:drawXORRect
EMPIRICAL
E V A L UA T I O N
RESEARCH QUESTIONS
• RQ1:How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers?
EMPIRICAL
E V A L UA T I O N
RESEARCH QUESTIONS
• RQ1:How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers?
• RQ2:How does LINSEN compare with state-of-the-art approaches as for the mapping of identifiers to dictionary words?
EMPIRICAL
E V A L UA T I O N
RESEARCH QUESTIONS
• RQ1:How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers?
• RQ2:How does LINSEN compare with state-of-the-art approaches as for the mapping of identifiers to dictionary words?
• RQ3:What is the ability of the LINSEN approach in dealing with different types of abbreviations?
CASE STUDIESDTW (Madani et. al 2010)
GenTest+Normalize (Lawrie and Binkley, 2011)
RQ1 andRQ2}
RQ1 only
LUDISO Dataset (2012)
AMAP (Hill and Pollock, 2008)RQ3 only
15 out of 750 software systemsCovering the 58% of total identifiers
EMPIRICAL
E V A L UA T I O N
EVALUATION METRICS
Comparability of Results: Accuracy rate
Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011]
EMPIRICAL
E V A L UA T I O N
EVALUATION METRICS
Comparability of Results: Accuracy rate
Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011]
EMPIRICAL
E V A L UA T I O N
EVALUATION METRICS
Comparability of Results: Accuracy rate
• Identifier Level evaluation: Each mapping result must be completely correct
• Soft-word Level evaluation: “Partial credit” given to each word correctly mapped
Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011]
As for the comparison with GenTest+Normalize (Lawrie and Binkley, 2011)
EMPIRICAL
E V A L UA T I O N
RQ1: SPLITTING
Accuracy Rates for the comparison withDTW (Madani et. al 2010)
0
0.25
0.5
0.75
1
JhotDraw 5.1 Lynx 2.8.5DTW LINSEN DTW LINSEN
DTWLINSEN
RESULTS
R Q 1
Accuracy Rates for the comparison with GenTest (Lawrie and Binkley, 2011)
0
0.175
0.35
0.525
0.7
which 2.20 a2ps 4.14
Identifier Level
0
0.2
0.4
0.6
0.8
which 2.20 a2ps 4.14
Soft-word Level
GenTestLINSEN
GenTestLINSEN
RESULTS
R Q 1 RQ1: SPLITTING
Accuracy Rates for the comparison withDTW (Madani et. al 2010)
0
0.25
0.5
0.75
1
JhotDraw 5.1 Lynx 2.8.5
DTWLINSEN
RQ2: MAPPINGRE
SULTSR Q 2
Accuracy Rates for the comparison with Normalize (Lawrie and Binkley, 2011)
0
0.15
0.3
0.45
0.6
which 2.20 a2ps 4.14
Identifier Level
0
0.225
0.45
0.675
0.9
which 2.20 a2ps 4.14
Soft-word Level
NormalizeLINSENNormalize
LINSEN
RESULTS
R Q 2 RQ2: MAPPING
Accuracy Rates for the comparison withAMAP (Hill and Pollock, 2008)
0
0.225
0.45
0.675
0.9
CW DL OO AC PR SL
AMAPLINSEN
RQ3: EXPANSIONRE
SULTSR Q 3
CW: Combination WordsDL: Dropped LettersOO: Others
AC: AcronymsPR: PrefixSL: Single Letters
CONCLUSIONS
CONCLUSIONS
CONCLUSIONS
CONCLUSIONS
CONCLUSIONS
FUTURE WORKS• Evaluation of the impact of each adopted dictionary on
the performance
• Improve or change or add dictionaries
• Improve the implementation of the prototype to speed up the computation
• Make use of parallel computation to process each identifier in isolation
THANK YOU
26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy
Valerio MaggioPh.D. Student, University of Naples “Federico II”[email protected]