Search engine and services
Course: Location Aware Machine IntelligencePresented by : Celestine Mkama Kalendero
25.02.2014
Outline1. Search Engine results ranking based on location2. Review of Personalized Mobile Search Engine 3. Extraction of Address Data from Unstructured Text
Search Engine Results Ranking based on Location
Carolyn Watters and Ghada AmoudiFaculty of Computer Science, Dalhousie University, Halifax, Nova
Scotia. Canada. E-mail: [email protected] Year: 2003
Result Ranking in Search engine
( as in the year 2002 )Search engine build their indexes based on a) Keyword occurence Frequency of query negotiation
Prons+ Robust, FastCons- User sort through pages when queries related to physical
distance and location 44 % of users frustrated by search engine (Realname,2000)
Geosearcher Location based ranking system Translate search reference point into coordinates (Long,Lat) Rank search results in ascending order based on distance
Geosearcher architecture
Geosearcher architecture-Query Presented by end system users e.g skiing resort District of Columbia Query- Skiing resolt Reference Point- District of Columbia Sample random Urls available ( used for evaluation )
Geosearcher architecture-Geocoding
Process of assigning latitude and longitude coordinates to the host for each site;
- Preliminary work ( Perfomed by researchers)a) Determine Locationb) Create Lookup table
Geosearcher architecture-Geocoding
a) Determining Location From Host Urls – DNS,Country Codes,Whois database
- Map location into coordinates e.g Use Getty Thesaurus(GS) to map location into cordinates + Containing state and area code for US,Canada + Other Countries
b) Lookup Table - Country Codes with Coordinates
www.about.comwww.dartmouth.camathresource.com
Geosearcher architecture-Geocoding
a) Determining Location From Host Urls – DNS,Country Codes,Whois database
- Map location into coordinates e.g Use Getty Thesaurus(GS) to map location into cordinates + Containing state and area code for US,Canada + Other Countries
Lookup TableCountry Code State Code Area Code Coordinates(Lat,Long)US AL 25634.9200, 87.2703 US CA 53038.8951, 77.0367CA NS 90245.0000, 63.0000FI Helsinki 60.1708, 24.9375
NO Oslo 59.9500, 10.7500
Example: Location Information
Getty thesaurus
Whois Database
Geosearcher architecture-Geocoding
The Processa) Check coordinates from host tableb) If not, send domain to whois -Return Country Code(CC) and Area code on Match If CC is ca or us and area code, Lookup in Table :- Get state
name or province c) If not ,strip down domain by 1 level (i.e data.about.com to
about.com )d) Unmatched names checked in IPtoLL(Host-LatLong Conversion) - IPtoLL uses administrative contactStore Results in host table
Next
Geosearcher architecture-Geocoding
The Processa) Check coordinates from host tableb) If not, send domain to whois -Return Country Code(CC) and Area code on Match If CC is ca or us and area code, Lookup in Table :- Get state
name or province
Host TableHost Coordinates(Lat,Long)
www.skibluemt.com 34.9200, 87.2703
www.dcski.com 38.8951, 77.0367
Distance and Ranking
For Ranking URL in host table from ref Location Calculated using haversine distance Stored in session host table Rank results based on distance (Insertion sort)
Results
Unranked Result-
Altavista
Using Geosearcher
Results..contdValidation of accuracy Examined 100 result manually for Location Information 90 websites assigned correctly
78% of 83 URLs were accurately identified
Results..contdAlgorithm Effectiveness Tested with 10 sets of 100 URLs using Yahoo Random Link
generator
Personalized Mobile Search Engine Using Location and Content Concepts
Namrata G Kharate ME-Computer-II
MCOERC, Nasik-India
Prof. S. A. BhavsarAssistant Prof. Computer Dept.
MCOERC, Nasik-India
Publication: November, 2013
Search - Mobile Devices Search queries on mobile Devices – Shorter,ambiguous Search Results- Less Accurate
Solution We need a system that capture user preference to return
personalized result ranking Personalized Mobile Search Engine (PMSE)
PMSE- System Architecture
RSVM- Ranking Support Vector Machine Next
PMSE- System Architecture
RSVM- Ranking Support Vector Machine
PMSE
Client Receive user requests Store Click through Data (Location,Content) Submit Request to server Display results Profile preference in ontology based user profile
Server Forward request to commercial search engine RSVM Training Search Result Reranking
Extraction of Address Data from Unstructured Text using Free Knowledge Resources
Sebastian [email protected]
Simon [email protected]
Publication: November, 2013
Ralf [email protected]
Christoph [email protected]
Multimedia Communications LabTechnische UniversitätDarmstadt Germany
Extraction of Address Data
Is of interest in various domainso Location – based serviceso Address respiratory –automatically created
- Automatic harvesting of web address is not possible
Solution Identify business address data,hybrid approach
Combine Pattern & Gazetteers
Address Structure-Germany
Company Name- No special pattern Street- varies, Burgermeister-Jung,Bgm.-Jung Street # - Digit sequence, e.g 45a,45-47 Postal Code-exactly 5 numbers,reserved Cities –Frankfurt,Ffm,Frankfurt/Main
Address Data IdentificationWorkflow
Address Data IdentificationPreprocessing Strip HTML Markup –e.g using Beautiful Soap Library Clearing- Removing non-unicode chars,White space btn
numbers Line Splitting and Tokenizing –using Apache openNLP toolkit Part of Speech Tagging- using TreeTagger
Next
Address Data IdentificationLine Splitting and Tokenizing –using Apache openNLP toolkit
Address Data Identification1. Postal Codes
Token regular expression [0-9]{5}2. Cities
Generated list based on OpenStreetMap accessed via Overpass-API (28,087 entries)
oKnown city found in the listoPreceded directly by postal code
Address Data Identification3. Street Numbers
Use Regular expression ([0-9]{1,3})([a-zA-Z][0-9]?)?(([+|-])([0-9]{1,3})([a-zA-Z][0-9]?)?)?
4. Steet NamesGenerated list based on OpenStreetMap
accessed via Overpass-API (300,000 entries)oUse street name endings e.g str
Address Data Identification5. Company Name Search Identical terms ( Wikipedia )- 29 terms e.g GmbH-Private,AG-Public Exploit standard address structure
Evaluation & Methology Site with Legal Note (1,576 websites )
Fraction of full address identified correctly
Rcorrect Address- 0.946, Rcompany-0.82
complete address w/o
company name
complete address with
company name
company name
street city0.50.60.70.80.9
1
Precision
Recall
ConclusionSearch engine Ranking Evaluation- Algorithm was accurate and effective Efficiency- Impacted by reliance on external databases
Reccommendation Have Database of special resources – Increase efficiency Adaptation to other languages- Address extraction
Thank You!
(Q&A)