Date post: | 21-Jan-2018 |
Category: |
Technology |
Upload: | masudur-rahman |
View: | 171 times |
Download: | 0 times |
Finding Similar Projects in GitHub using Word2Vec and WMD MD MASUDUR RAHMAN
DEPARTMENT OF COMPUTER SCIENCE
UNIVERSITY OF VIRGINIA
1
IntroductionGiven project details (description and source code), the aim is to find functionally similar projects
Finding functionally similar project is importantApplication/project recommendation
Code re-use, rapid prototyping
Discovering code plagiarism
CS@UVa 2
Code re-use Plagiarism checking
Application/project Recommendation
How developer search for similar projects?
General Purpose Search(Google)
CS@UVa 3
Query: android browser
Try to find application relevant to the query
Not intended to search for source code
GitHub Search: android browser
CS@UVa 4
Mostly keyword based search on textual contentsProject name, description, etc.
Open and analyze jar, class, apk, etc.
Might rank irrelevant projects at the top
Less textual content
Use source code contentAugment content by Method, Class, and API name
Model Workflow
5
GitHubProjects
Data Preprocessing (per feature)
(Tokenization, Normalization, Stemming, Stopwords
Removal, TF-IDF score based
word filtering)
Feature Extraction(Description, Readme, Method
& Class Name, API Package Name, API Class name)
Document Generation(combined all features)
Search Interface
Candidate Project
Documents
Query ProjectDocuments
Document Similarity Computation
(Word2Vec, WMD)
Search Result (Ranked list of similar projects)
CS@UVa
Model Workflow
6
GitHubProjects
Data Preprocessing (per feature)
(Tokenization, Normalization, Stemming, Stopwords
Removal, TF-IDF score based
word filtering)
Feature Extraction(Description, Readme, Method
& Class Name, API Package Name, API Class name)
Document Generation(combined all features)
Search Interface
Candidate Project
Documents
Query ProjectDocuments
Document Similarity Computation
(Word2Vec, WMD)
Search Result (Ranked list of similar projects)
CS@UVa
How to measure document similarity?
Document 1: image gallery app for Lollipop
7
Keyword based Cosine similarityBag of Word (BOW)
Document 2: android photo viewer
No common keyword!Cosine similarity = 0
CS@UVa
How to measure document similarity?
Document 1: image gallery app for Lollipop
8
Document 2: android photo viewer
Word Embedding
𝑤1𝑤3𝑤2
𝑤4
CS@UVa
Word Embedding“You shall know a word by the company it keeps” –J. R. Firth 1957
9
Open source upgrade path for Odoo/OpenERP
Plugin to check for obvious upgrade points on the path to 3.0
Codes related to upgrade project
Demo app to demonstrate how to upgrade from Angular 1 to Angular 2
Learn word vector for upgrade by its surrounding words Word2Vec
0.2860.792-0.171-0.1050.5440.351-0.6530.274
upgrade
CS@UVa
Word2VecInput: Text corpus
CS@UVa 10
0.2860.792-0.171-0.1050.5440.351-0.6530.274
upgrade
Word2Vec Model
Word Embedding
Output: Word vectorsTraining
Word2Vec Model
CS@UVa 11
Document: image gallery app for android
Skip-gram
image
gallery
app
for
android
Example Word EmbeddingIn Embedded spaceSimilar meaning word clustered together
CS@UVa 12
imagephoto
picture figure
sampleexample
demo illustration
upgrade update
modifychange
install setup
launchchange
dimension size
heightlength
range
Embedding for each word
How to get document/sentence level similarity? Word Mover’s Distance (WMD)
Word Mover’s Distance(WMD)
CS@UVa 13
image LollipopappgalleryD1
android viewerphotoD2
0.10.50.7
Word Mover’s Distance
CS@UVa 14
image LollipopappgalleryD1
android viewerphotoD2
0.10.50.7
Word Mover’s Distance
CS@UVa 15
image LollipopappgalleryD1
android viewerphotoD2
0.35
0.20.6
Word Mover’s Distance
CS@UVa 16
image LollipopappgalleryD1
android viewerphotoD2
0.350.150.2
Word Mover’s Distance
CS@UVa 17
image LollipopappgalleryD1
android viewerphotoD2
0.4
0.30.1
Word Mover’s Distance
Similarity Score(D1, D2) = 0.1 + 0.2 + 0.15 + 0.1 = 0.55
Smaller score means more similar
CS@UVa 18
image LollipopappgalleryD1
android viewerphotoD2
0.15
0.20.1
0.1
Preliminary Results
19
Project Name Description Project Type
Query/ Rank
android_browserCustomize android webclient(source code with readme file)
Lightning based android browser
1 Myfacebook MyFacebook source code Lightning based android browser
2 Speed-Browser-4G-Plus Speed Browser 4G Plus is based on Lightning Browser, and licensed under the Mozilla Public License, v. 2.0..
Lightning based android browser
3 Web-browser Web browser is based on Lightning Browser, and licensed under the Mozilla Public License, v. 2.0..
Lightning based android browser
4 JumpGo JumpGo Web Browser for Android JumpGo Android Browser
5 VChrome Build an test browser for Viettel in job interview Android Browser
CS@UVa
SummaryWe proposed a model for finding functionally similar projects in GitHub
Used textual and source code content to construct document
Measured similarity between document adopting Word Mover’s DistanceLeveraged Word2Vec word embedding
20
ReferenceWord2vec : Gensim python library https://radimrehurek.com/gensim/models/word2vec.html
WMD https://github.com/mkusner/wmd
Wikipedia Dump. https://dumps.wikimedia.org/enwiki/
GitHub Projects Data: The GHTorrent projecthttp://ghtorrent.org/
21CS@UVa
Question?
22CS@UVa