Post on 16-Jul-2015
transcript
Boehringer Ingelheim Pharma GmbH & Co. KGScientific Information Center – S.I.C.
WebCrawling / Internet Research Emancipation from Public Search
Aleksandar Kapisoda & Klaus Kater (black swan )
Content
1. Intro: Why we need our own web crawler and search engine
2. Focus on competitive technology and startups:Building proprietary SEARCHCORPORA to
• Find new technology, e.g. university spin-offs / licenses (search)
• Monitor activities of known competitors (alerting)
3. Scientific Information Center - Workflow
4. What S.I.C. Can Now Offer to the Customers
• Targeted SEARCHCORPORA
• Automatic alerting
5. Outlook: What we want to achieve in the next steps
• Ontology mapping
2
The Sea of Information
News Feeds(RSS, Email-Alerts, Newsletters)
Personal Web Observation
(Browser with Google)
6
The Sea of Information
Personal Web Observation
(Browser with Google)
Social Media
News Feeds(RSS, Email-Alerts, Newsletters)
7
The Sea of Information
Personal Web Observation
(Browser with Google)
Internet of Things(Patient Health Sensor Data)
Social Media
News Feeds(RSS, Email-Alerts, Newsletters)
http://www.teleskop-austria.at/information/bino-coin-tl/Coin100-1.jpghttp://www.easymarmaris.com/uploaded_tour_files/1397573325jet_Ski_7.jpg 8
The Sea of Information
Personal Web Observation
(Browser with Google)
Internet of Things(Patient Health Sensor Data)
Internal Information(Corporate Databases, Intranet)
Social Media
News Feeds(RSS, Email-Alerts, Newsletters)
http://www.teleskop-austria.at/information/bino-coin-tl/Coin100-1.jpghttp://www.easymarmaris.com/uploaded_tour_files/1397573325jet_Ski_7.jpg 9
Our Lack of Information
Personal Web Observation
Social Media
What we actually find using public search (Google)
10
Our Lack of Information
All other information is Deep Web informationthat cannot be searched with Public Search.
11
Google repository
Google Rating Magic
Google Ads
Surf behavior
User profile
Array of Googlebots
WWW
.comgoogle
.de …
max 1000 results
Public search does not allow access to Deep Web information
• Number of results artificially limited
• Search hit filter logic is not revealed
• Single document content index
The Lack of Information
and also
12
Focus on Competitive Technology and Startups:Building Proprietary SEARCHCORPORA
Find new technology, e.g. university spin-offs / licenses (PULL)
• Provide custom SEARCHCORPUS
• Start from technology transfer organizations / universities (spin-offs in 1st step)
1. Crawl information about spin-offs companies (address, website)
2. Extract technology categories
3. Crawl and index websites
4. Build SEARCHCORPUS
• Customize SEARCHCORPUS Viewer1)
• Publish SEARCHCORPUS Viewer in corporate intranet
1) In addition to common search queries we support fuzzy search, proximity search and phrases15
Side Note: Annotating target documents with topic specific content to build searchable contexts
Surface Web
Deep WebCorporateResources
We can find documents using search terms that appear in the context but not necessarily in the document’s content.
16
Focus on Competitive Technology and Startups:Building Proprietary SEARCHCORPORA - PULL
Find new technology, e.g. university spin-offs / licenses (PULL)
http://www.example_url.com
names and data of targets
crawl
extract
crawl
Target SEARCHCORPUS
expressions to scrape data from pages of published targets
SEARCHCORPUS Viewer
17
Focus on Competitive Technology and Startups:Building Proprietary SEARCHCORPORA - PUSH
Monitor activities of known competitors (PUSH)
• Weekly alerts
• Currently concentrating on public companies (3 different websites as sources)
1. Crawl and extract ticker symbols (>15.000 public companies)
2. Crawl and scrape company information (address, website, industry, sector)
3. Crawl and index company news
• For each topic of interest, we create targets as search queries1)
e.g. “oncology AND acquisition” to find out, who acquired oncology companies
• Alerts are automatically sent by email
1) In addition to common search queries we support fuzzy search, proximity search and phrases18
Monitor activities of known competitors (PUSH)
Focus on Competitive Technology and Startups:Building Proprietary SEARCHCORPORA - PUSH
http://www.example1.com…
http://www.finance.example.com- seed urls
crawl
extract
crawl
Company data, industry, sectorDescription, …
expressions to extractstock market ticker symbols
newspage.com seed urls
crawl
newspage.com
company news pages
crawl
newspage.com
company news
linkCompany news corpus
User profile
matchMatching news
sendalerts
On a monthly scheduleOn a weekly schedule
Email alert
19
Scientific Information Center Workflow
Implementing a Business Process to Offer SEARCH as a Service
20
Scientific Information CenterWorkflow
Project Inquiry
Specify Scope
Setup Chains
Review
Research DepartmentCustomer
Crawler
Crawling
Analyzing
Daily use Scheduled Updates
possiblyin iterations
Information Scientist
21
Information Scientist Engineer
Scientific Information CenterWorkflow
Viewer
…
SEARCHCORPUS Designer
Scheduler / Engine
ContainerToolsReport / XLS
…
Research Department
22
Scientific Information CenterWorkflow
Pay off
tDevelopment Test
Actual Usage
Ongoing Optimization
no predetermined end of life time
The value of a SEARCHCORPUS increases over time.Cost
23
What S.I.C. Can Now Offer to the Customers
Automatic alerting
Targeted SEARCHCORPORA
Email Client
SERACHCORPUS ViewerB
len
ded
into
B
I In
tran
et S
olu
tion
Project
Alert Profile(Search Terms)
Scheduled Alerts
Push
Scheduled Updates
Project
SEARCH Profile(Targets)
Scheduled Updates
Faceted SEARCH
Pull
Crawler
SIC Crawler
24
Outlook: What We Want to achieve in the Next Steps
Technology
User Perspective
GUI for defining Alert Profiles
• Broader project scopes• Larger SEARCHCORPORA• More sources
Ontology Mapping
• Map SEARCHCORPUS entries to Ontologies • Faceting over Ontologies• Ontology Management: Import AND build ontologies
25