AFTERCOLLEGESELF- SERVICE SCRAPE CONFIGURATION AND POSTING UTILITY
Kai HuHaiyan WuMarch 17, 2009 @ Cowell 416Midterm Presentation
PRESENTATION OUTLINE
Background and MotivationGoalsDesignChallengesTimeline and MilestonesCurrent Progress
2
04
/18
/23
Afte
rColle
ge S
crape U
tility
AFTERCOLLEGE BACKGROUND
Customized career network for colleges and professional organizations across the country
Goal: create a better way for job seeking students and alumni to connect with the right employer
3
04
/18
/23
Afte
rColle
ge S
crape U
tility
4
4
04
/18
/23
Afte
rColle
ge S
crape U
tility
WHAT’S ALREADY THERE?0
4/1
8/2
3
AfterCollege staff manually creates configuration files
A simple crawler running periodically
Output of Crawler is posted on AfterCollege’s website
5
Afte
rColle
ge S
crape U
tility
LIMITATIONS
ScalabilityUnable to handle POST requestsUnable to handle dynamic websitesExpensive to maintainRequires technical knowledge
6
04
/18
/23
Afte
rColle
ge S
crape U
tility
DESIGN OVERVIEW0
4/1
8/2
3
7
A new GUI Tool assists staffs through configuration process
Web Proxy captures user activities
Crawler uses pattern matching based on new configuration file
Afte
rColle
ge S
crape U
tility
GOALS: GUI TOOL
Guides users through configuration process Deal with dynamic websites
8
04
/18
/23
Afte
rColle
ge S
crape U
tility
GOALS: WEB PROXY
Capture user activities Generate configuration files
9
04
/18
/23
Afte
rColle
ge S
crape U
tility
GOALS: CRAWLER
Scrape job posts Check result integrity
10
04
/18
/23
Crawl Job List page
Get Configuration file
Pattern-Match Application
Generate Job List Result
Afte
rColle
ge S
crape U
tility
DESIGN ISSUES FireFox Plugin vs. Web Proxy
Integration with back-end Ability to add functionalities
Dojo vs. YUI- Fade-In/Out, Drag & Drop - Deals with different browsers-
XML vs. JSON Simplicity & efficiency on parsing Availability of wrapper methods in YUI
11
04
/18
/23
Afte
rColle
ge S
crape U
tility
DESIGN OVERVIEWBrowser 0
4/1
8/2
3
12
Rendered HTML page
Injected YUI Javascript
Web Proxy
Apache
HTTP Client
Tomcat Web/App Server
HTML Parser
Job List Sites
Crawler
Loader/ Scheduler
Parser
HTTP ClientConfig.xml
JobFeed.xml Feed Generator
Afte
rColle
ge S
crape U
tility
CHALLENGESDOM objects analysis at runtime for those
websites using AJAX to dynamically generate DOM objects at client side
Deal with tricky Javascript
Embedded HTML pages
13
04
/18
/23
Afte
rColle
ge S
crape U
tility
MILESTONES GUI Tool (March 20)
Work flow support Capture job information
Web Proxy (March 20) Render html pages Capture HTTP communications
Web Crawler (April 13) Pattern Matching ability given configuration file Integrity check
Integration Test (April 20) Testing (April 27) 14
04
/18
/23
Afte
rColle
ge S
crape U
tility
CURRENT FOCUS
Web Proxy Ability to deal with Javascript Session/Cookie support
GUI Tool Embedded web pages Allow user modifications
15
04
/18
/23
Afte
rColle
ge S
crape U
tility
CURRENT PROGRESS
Demo
16
04
/18
/23
Afte
rColle
ge S
crape U
tility
RESOURCES Course Instructor
Dr. Jeff Buckwalter Sponsor
Steve Girolami, Perry Lee, & Saan Saeteurn Source code control System
Dasidae SVN from Perry Wiki Site
Knowledge share, work log, resource portal Google group
Discussion and information exchange medium
17
04
/18
/23
Afte
rColle
ge S
crape U
tility
`
Questions?
18
04
/18
/23
Afte
rColle
ge S
crape U
tility
Thank You
19
04
/18
/23
Afte
rColle
ge S
crape U
tility