Kdd Cup 2013 Author Paper Identification Final Report

transcript

Kdd Cup 2013Author Paper IdentificationFinal Report

Ben Deng – M10112006

Outline

Problem Description Database Analysis Research Issue Proposed Ideas Results

Problem Description

Inside the research community, it has more than 50 million publications and 19 million authors.

However every Journal, Letter, Conference… have their own format. This include author names. In addition, these formats can lead an author-name ambiguity. For instance, abbreviations, identical names, name misspellings, pseudonyms.

All these problems can result in incorrect assign to various authors, and it is enormous problem when we want to search for specific author. The main goal is how to recognize the author and correctly assign the publications to them.

Database Analysis

Author.csv

Affiliation (missing data, noise)

PaperAuthor.csv

PaperID

AuthorID

Affiliation (missing data, noise)

Paper.csv

ConferenceId

JournalId

Keywords (missing data)

Journal.csv

ShortName

FullName

HomePage

Conference.csv

ShortName

FullName

HomePage

Research Issue

Lot of data are missing Noise in affiliation column

(especially with foreign affiliation) Name ambiguity (especially name

with chinese origin) Authors have different

abbreviations from different Journals and/or Conference

Proposed Ideas

Filling missing data. Counting how many different

affiliations the same author has. Using keywords, how many times

the same keyword was used. Class weight is fixed to be auto.

Filling missing data

In order to normalize the tables such that a one to one join table was created between them which joins each column1 to a single column2, if indeed there should be exactly one column2 per column1.

SQL Code

UPDATE table t

SET city = c.column2 FROM (SELECT column1, MAX(column2) AS column2 FROM table WHERE column2 IS NOT NULL GROUP BY column1) c

WHERE t.column2 IS NULL AND column1= c.column1;

Simulation and Results

Random Forest (Classifier) Gradient Boosting (Classifier) Decision Tree (Classifier) K Nearest (Classifier)

Random Forest

Result is 0.51341, however I am expecting for 0.80217

Using the same code from Github (same parameters)

Random Forest

Result is 0.52469

Parameters of Python Code

RandomForestClassifier(n_estimators=200, criterion='gini', max_depth=None, min_samples_split=15, min_samples_leaf=1, min_density=0.10000000000000001, max_features='auto', bootstrap=True, compute_importances=False, oob_score=False, n_jobs=2, random_state=None, verbose=0)

Decision Tree

Result is 0.47386

DecisionTreeClassifier(criterion='gini', max_depth=None, min_samples_split=15,min_samples_leaf=1, min_density=0.10000000000000001,max_features=‘auto’, compute_importances=False,random_state=1)

Gradient Boosting

Result is 0.53506

GradientBoostingClassifier(loss='deviance', learning_rate=0.00001,n_estimators=250, subsample=0.5, min_samples_split=2, min_samples_leaf=1, max_depth=10, init=None,random_state=None, max_features=None, verbose=0

K Nearest

Result is 0.48297

KNeighborsClassifier(n_neighbors=50, weights=‘distance', algorithm='auto', leaf_size=30, p=2)

SVM SVC, Nu-SVC, LinearSVC

Support Vector Machine (SVC, Nu-SVC and LinearSVC) were tested.

However the training was taking more than 3 days and they are still training the classifier. So, I did not be able to finish the training and submit the results using SVM.

Thank You

Kdd Cup 2013 Author Paper Identification Final Report

Documents