+ All Categories
Home > Documents > Project IDI PPT

Project IDI PPT

Date post: 23-Feb-2017
Category:
Upload: david-i-widjaja
View: 74 times
Download: 1 times
Share this document with a friend
11
Project IDI David I Widjaja
Transcript
Page 1: Project IDI PPT

Project IDIDavid I Widjaja

Page 2: Project IDI PPT

Steps

Data Extraction Tagging Correlation Web Scraping Comparison Documentation

Page 3: Project IDI PPT

Data Extraction

How to get the data? Input from database Input manually

Data type: Topics that is made of strings

Page 4: Project IDI PPT

Tagging Prerequisite:

Topic Sentences (Subject) Dictionary (Tags)

Page 5: Project IDI PPT

Dictionary How to create tags:

1. Get all topic sentences and split them between white space2. Convert all words into lower case 3. Delete all numeric and duplicate values 4. Sort words alphabetically 5. Delete unnecessary words (e.g. is, the, and, etc.)6. Search for synonym words and cluster them into a single tag7. Translate words if necessary8. Insert tags into main spreadsheet

Page 6: Project IDI PPT

Correlation

A weighted graph map is used: The larger the amount of word

associated with the tag, the bigger the bubble.

Lines get thicker according to the number of relationship between topics.

Page 7: Project IDI PPT

Web Scraping Web Scraping on other similar

websites Take the topic sentences to be in the

subject columns. Examples: Article Titles Comments Etc.

Copy to previous spreadsheet (The one with the pervious tags).

Page 8: Project IDI PPT

Correlation

Do the same process as before on the weighted graph map

Page 9: Project IDI PPT

Comparison Compare the two weighted graph maps

Page 10: Project IDI PPT

Word Cloud Generate Word Cloud using Python or online tools.

e.g.

Page 11: Project IDI PPT

Tools

Microsoft Excel 2013 (Spreadsheet)

Mozilla Firefox (Browser) Inspect Element (Search Patterns) DownThemAll (Download HTMLs)

Total Commander (Merge HTMLs) Notepad++ (Cleanse Data)


Recommended