+ All Categories
Home > Technology > Finding Similar Projects in GitHub using Word2Vec and WMD

Finding Similar Projects in GitHub using Word2Vec and WMD

Date post: 21-Jan-2018
Category:
Upload: masudur-rahman
View: 171 times
Download: 0 times
Share this document with a friend
22
Finding Similar Projects in GitHub using Word2Vec and WMD MD MASUDUR RAHMAN DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF VIRGINIA 1
Transcript
Page 1: Finding Similar Projects in GitHub using Word2Vec and WMD

Finding Similar Projects in GitHub using Word2Vec and WMD MD MASUDUR RAHMAN

DEPARTMENT OF COMPUTER SCIENCE

UNIVERSITY OF VIRGINIA

1

Page 2: Finding Similar Projects in GitHub using Word2Vec and WMD

IntroductionGiven project details (description and source code), the aim is to find functionally similar projects

Finding functionally similar project is importantApplication/project recommendation

Code re-use, rapid prototyping

Discovering code plagiarism

CS@UVa 2

Code re-use Plagiarism checking

Application/project Recommendation

How developer search for similar projects?

Page 3: Finding Similar Projects in GitHub using Word2Vec and WMD

General Purpose Search(Google)

CS@UVa 3

Query: android browser

Try to find application relevant to the query

Not intended to search for source code

Page 4: Finding Similar Projects in GitHub using Word2Vec and WMD

GitHub Search: android browser

CS@UVa 4

Mostly keyword based search on textual contentsProject name, description, etc.

Open and analyze jar, class, apk, etc.

Might rank irrelevant projects at the top

Less textual content

Use source code contentAugment content by Method, Class, and API name

Page 5: Finding Similar Projects in GitHub using Word2Vec and WMD

Model Workflow

5

GitHubProjects

Data Preprocessing (per feature)

(Tokenization, Normalization, Stemming, Stopwords

Removal, TF-IDF score based

word filtering)

Feature Extraction(Description, Readme, Method

& Class Name, API Package Name, API Class name)

Document Generation(combined all features)

Search Interface

Candidate Project

Documents

Query ProjectDocuments

Document Similarity Computation

(Word2Vec, WMD)

Search Result (Ranked list of similar projects)

CS@UVa

Page 6: Finding Similar Projects in GitHub using Word2Vec and WMD

Model Workflow

6

GitHubProjects

Data Preprocessing (per feature)

(Tokenization, Normalization, Stemming, Stopwords

Removal, TF-IDF score based

word filtering)

Feature Extraction(Description, Readme, Method

& Class Name, API Package Name, API Class name)

Document Generation(combined all features)

Search Interface

Candidate Project

Documents

Query ProjectDocuments

Document Similarity Computation

(Word2Vec, WMD)

Search Result (Ranked list of similar projects)

CS@UVa

Page 7: Finding Similar Projects in GitHub using Word2Vec and WMD

How to measure document similarity?

Document 1: image gallery app for Lollipop

7

Keyword based Cosine similarityBag of Word (BOW)

Document 2: android photo viewer

No common keyword!Cosine similarity = 0

CS@UVa

Page 8: Finding Similar Projects in GitHub using Word2Vec and WMD

How to measure document similarity?

Document 1: image gallery app for Lollipop

8

Document 2: android photo viewer

Word Embedding

𝑤1𝑤3𝑤2

𝑤4

CS@UVa

Page 9: Finding Similar Projects in GitHub using Word2Vec and WMD

Word Embedding“You shall know a word by the company it keeps” –J. R. Firth 1957

9

Open source upgrade path for Odoo/OpenERP

Plugin to check for obvious upgrade points on the path to 3.0

Codes related to upgrade project

Demo app to demonstrate how to upgrade from Angular 1 to Angular 2

Learn word vector for upgrade by its surrounding words Word2Vec

0.2860.792-0.171-0.1050.5440.351-0.6530.274

upgrade

CS@UVa

Page 10: Finding Similar Projects in GitHub using Word2Vec and WMD

Word2VecInput: Text corpus

CS@UVa 10

0.2860.792-0.171-0.1050.5440.351-0.6530.274

upgrade

Word2Vec Model

Word Embedding

Output: Word vectorsTraining

Page 11: Finding Similar Projects in GitHub using Word2Vec and WMD

Word2Vec Model

CS@UVa 11

Document: image gallery app for android

Skip-gram

image

gallery

app

for

android

Page 12: Finding Similar Projects in GitHub using Word2Vec and WMD

Example Word EmbeddingIn Embedded spaceSimilar meaning word clustered together

CS@UVa 12

imagephoto

picture figure

sampleexample

demo illustration

upgrade update

modifychange

install setup

launchchange

dimension size

heightlength

range

Embedding for each word

How to get document/sentence level similarity? Word Mover’s Distance (WMD)

Page 13: Finding Similar Projects in GitHub using Word2Vec and WMD

Word Mover’s Distance(WMD)

CS@UVa 13

image LollipopappgalleryD1

android viewerphotoD2

0.10.50.7

Page 14: Finding Similar Projects in GitHub using Word2Vec and WMD

Word Mover’s Distance

CS@UVa 14

image LollipopappgalleryD1

android viewerphotoD2

0.10.50.7

Page 15: Finding Similar Projects in GitHub using Word2Vec and WMD

Word Mover’s Distance

CS@UVa 15

image LollipopappgalleryD1

android viewerphotoD2

0.35

0.20.6

Page 16: Finding Similar Projects in GitHub using Word2Vec and WMD

Word Mover’s Distance

CS@UVa 16

image LollipopappgalleryD1

android viewerphotoD2

0.350.150.2

Page 17: Finding Similar Projects in GitHub using Word2Vec and WMD

Word Mover’s Distance

CS@UVa 17

image LollipopappgalleryD1

android viewerphotoD2

0.4

0.30.1

Page 18: Finding Similar Projects in GitHub using Word2Vec and WMD

Word Mover’s Distance

Similarity Score(D1, D2) = 0.1 + 0.2 + 0.15 + 0.1 = 0.55

Smaller score means more similar

CS@UVa 18

image LollipopappgalleryD1

android viewerphotoD2

0.15

0.20.1

0.1

Page 19: Finding Similar Projects in GitHub using Word2Vec and WMD

Preliminary Results

19

Project Name Description Project Type

Query/ Rank

android_browserCustomize android webclient(source code with readme file)

Lightning based android browser

1 Myfacebook MyFacebook source code Lightning based android browser

2 Speed-Browser-4G-Plus Speed Browser 4G Plus is based on Lightning Browser, and licensed under the Mozilla Public License, v. 2.0..

Lightning based android browser

3 Web-browser Web browser is based on Lightning Browser, and licensed under the Mozilla Public License, v. 2.0..

Lightning based android browser

4 JumpGo JumpGo Web Browser for Android JumpGo Android Browser

5 VChrome Build an test browser for Viettel in job interview Android Browser

CS@UVa

Page 20: Finding Similar Projects in GitHub using Word2Vec and WMD

SummaryWe proposed a model for finding functionally similar projects in GitHub

Used textual and source code content to construct document

Measured similarity between document adopting Word Mover’s DistanceLeveraged Word2Vec word embedding

20

Page 21: Finding Similar Projects in GitHub using Word2Vec and WMD

ReferenceWord2vec : Gensim python library https://radimrehurek.com/gensim/models/word2vec.html

WMD https://github.com/mkusner/wmd

Wikipedia Dump. https://dumps.wikimedia.org/enwiki/

GitHub Projects Data: The GHTorrent projecthttp://ghtorrent.org/

21CS@UVa

Page 22: Finding Similar Projects in GitHub using Word2Vec and WMD

Question?

22CS@UVa


Recommended