Date post: | 29-Dec-2015 |
Category: |
Documents |
Upload: | gwen-greene |
View: | 214 times |
Download: | 1 times |
WebWatch
Ian Peacock
UKOLN
University of Bath
Bath BA2 7AY
UK
+44 1225 323570
Email: [email protected]
WebWatching the UK:Robot software for analysing UK
web resources
UKOLN is funded by the British Library Research and Innovation Centre, the Joint Information Systems Committee of the Higher Education Funding Councils, as well as by project funding from the JISC’s Electronic Libraries Programme and the European Union.
UKOLN also receives support from the University of Bath where it is based.
Robot software
• WebWatch.
• WebWatch experiences.
• General robot issues.
• The need for robots.
• Bad press.
• Awareness.
The WebWatch project
• A one year post funded by RIC.
• “..to develop a set of tools to audit and monitor design practice and use of technologies on the web..”.
• Communities. UK web communities.
• Information to benefit institutions/communities.
The WebWatch project
Information on the project can be found at <URL:http://www.ukoln.ac.uk/web-focus/webwatch/>.
WebWatch aims
• Evaluation of robot technologies.
• Making recommendations on appropriate technologies.
• Working within UK web communities.
• Analysis of the results of web crawling and leasing with various communities in interpreting the results.
WebWatch aims
• Working with the web robot community.
• Analysing other related resources, such as web logs.
WebWatch robot
• Experimentation.
• Harvest.
• Perl based robot.
WebWatch analyses
• Production of a report.
• SOIF records.
• CSV.
• Excel, SPSS,…
• Current developments.
WebWatch benefits
Benefits
• Communities.
• Web managers and designers.
• Knowledge base.
WebWatch robot
• History– Harvest– Experiences with Perl– ?
• Features
• Future plans
WebWatch robot
Type{4}: HTML
Type-recognition by{4}: MIME
Linked from{23}: http://www.ukoln.ac.uk/
Context{4}: Link
Element-referrer{5}: LINKS
p-count{1}: 3
a-21-attrib{55}: href=http://www.ukoln.ac.uk/services/elib/papers/other/
img-9-attrib{110}: width=87|src=http://www.ukoln.ac.uk/resources/images/ukoln-logo/logo|height=101|alt=UKOLN|align=right|border=0
Examples of robot output
HTML element information
Robot issues
• Definition of a (web) robot.
• The need for robots
Robot issues
The need for robots?
• Web expansion and increasing non-linearity.
• Understanding the nature of the web to help solve problems.
• Maintenance.
• Construction of index-space.
• Navigable document-space.
Increasing non-linearity
URL A
URL B URL C
URL D
Benefits of robots
• End-user satisfaction.
• Reduced network traffic in document space.
• Populating caches, archiving, mirroring.
• Monitoring changes relevant to users.
• ‘Schooling’ network traffic into localised neighbourhoods.
Benefits of robots
• A user view (as opposed to a file-system view).
• Non fatiguing.
• Next generation.
• These properties offer feasible solution to web problems?
Robot design
• Is it necessary?
• Traversal algorithm (depth vs breadth first).
• Black holes and correct implementations (e.g. redirects).
• Bounds on activity.
• Multiple requests.
Example of a ‘black-hole’
Client requests: http://www.foo.bar/generate_report?date=02021998&time=1250
Server returns document with this link:<A
HREF=“http://www.foo.bar/generate_report&date=02021998&time=old_time+5”>
Robot design (continued)
• Caching directives
Ethical robots
• Reuse of robot code.
• Appropriate identification.
• Thorough testing (locally!).
• Speed/frequency bounding.
• Selective retrieval.
• Performance monitoring.
• Dissemination of results.
Ethical web crawling
• Advantages vs disadvantages..
• Guidelines
Robot Exclusion
Refers to means available to users and server administrators to control robot navigation through a particular server.
Advantages. Disadvantages.
Currently two kinds of Robot Exclusion Protocol (REP).
Robot exclusion protocols
• Server-wide method (/robots.txt)– Directives for the whole server must be
under the top level /robots.txt.
• META element method (per page).– Directives are inserted per page with the
META element. Directives allow for indexing (or not) and parsing for links (or not).
Other methods of robot control
• Blocking at the server configuration level (e.g. Apache’s allow from, deny from).
• Blocking at the TCP level (TCP wrappers?)
• Page design?
Network performance
• Bandwidth issues.
• Comparison with a human user.
• Bottlenecks.
• New developments in robots..good or bad? Decentralisation.
Server concerns
• Rapid fire requests (TCP, HTTP).
• Skewing of server logs.
• Identification of robots.
The future of web robots
• Intelligent agents.
• Metadata standards (XML, RDF, CDF, embedded metadata).
• Robots becoming part of the web.
WebWatch findingsAnalysis of URLs
Domains for public library web sites
WebWatch findingsServer software
Servers used to serve eLib project pages
WebWatch findingsFile size analyses
HTML file sizes for UK University entry-points
WebWatch findings
Top ten tags used within the eLib community
HTML analyses
WebWatch findingsHyperlink profiles
Top ten external domains linked to from all eLib pages
WebWatch findingsAnalysis of other document content
Use of metadata in UK university homepages