Date post: | 26-Dec-2015 |
Category: |
Documents |
Upload: | arron-harper |
View: | 221 times |
Download: | 3 times |
Replication, Concludedand New Trends Part I
Zachary G. IvesUniversity of Pennsylvania
CIS 455 / 555 – Internet and Web Systems
April 24, 2008
2
Reminders
Project demos Friday May 9 Demo must be on mod cluster, at least 4 machines Email to schedule early next week Please aim to have “code complete” by May 1 or so –
integration testing is important!
Project report and final code due Monday May 12 Includes experiments showing scalability of querying,
crawling, or indexing
Final exam May 12, 6-8PM, here in Towne 315
3
Recall: Central Issues in Replication
What to replicate Where to replicate How to maintain consistency
(and how fresh data needs to be) How to route requests to replicas
4
Choosing Which Replica to Use
Round-robin Simply allocate each request to the next server,
incrementing next at each point Does this make sense over the entire Internet?
Load balancing Requires some sort of load-monitoring code at each server
(e.g., how many threads running, queues available, etc.) Server feeds this information into the coordinator What are the dangers of doing this?
Topologically aware Tries to allocate requests to the “nearest” server, according
to network performance This is what content distribution networks like Akamai aim
for
5
Physically Mapping the Requests
Internal routing (transparent to client) One machine masquerades as the server It forwards requests to a particular machine This is commonly done by search engines, CNN,
etc.
Domain Name Server tricks When a client looks up the server, it gets a
different DNS address than other machines do – thus it transparently talks to someone else
Typically, content distribution networks do this
Let’s look at Akamai as an example…
6
Akamai: A Real-WorldContent Distribution Network
Goal is to produce replicas of data from “big” web sites e.g., images from CNN; QuickTime movie trailers from Apple Similar services from Cisco, Digital Island, Exodus, etc.
Basic model: Providers pay Akamai to distribute their content Akamai asks ISPs if they can install boxes in their networks
18,000 servers in 1,000 networks in 69 countries HTTP responses include a “base” container, plus references to
embedded content from “nearby” Akamai boxes Idea: HTML changes frequently; images, videos do not
Newer service, EdgeComputing, also creates JSP pages in the end nodes
7
Bird’s Eye View of AkamaiSupplying QuickTime Content
Client
QT Server1:
GE
T /i
ndex
.htm
2: O
K in
dex.
htm
href
= q
t.aka
mai
.tech
.net
Akamai DNS
3: QUERYqt.akamai.tech.net
4: IP 123.45.67.89(“nearby” Akamai box)
Replica123.45.67.895: GET /my.qt
6: OK my.qt
Akamaiinfrastructure
Occasionalupdates
Occas
iona
l
upda
tes
Network to
pology
informati
on
8
What Akamai’s Secret Algorithms Do
Replica maintenance Not every data item needs to be replicated everywhere
– determine what items should be placed at which replicas
This is based on ideas from peer-to-peer that we’ll discuss later in the course
Estimating the closest replica This is incredibly hard – how to figure out which replica
is the fastest to access – unless a replica at every ISP Studies have shown that Akamai and other CDNs aren’t
perfect [Johnson et al.] But it’s still typically “good enough”
Today Akamai also tries to deliver some kinds of dynamic content, using database-like technology
9
Replication, Summarized
Replication is generally controlled by the content producer – the server
Considerations about where to place replicas for max effect, how to maintain correct semantics of the application We’ve seen Akamai as a real example
Can also have something similar, but client-side-driven, potentially with support from intermediate points in the network…
10
Caching
Caching is, in essence, a lazy, request-driven version of replication
Generally needs to be fully transparent Implemented over standard HTTP Replication can often use proprietary protocols
However: servers want control over what gets cached Why is this important?
Caching can be done at either endpoint, or somewhere in between…
11
Why Caching Works on the Web
Web tends to have a Zipfian distribution of requests Flat in log-log scale A few items are very frequent, many are somewhat
frequent, a huge number are infrequent (“heavy tail”)
Graphs: www.useit.com/alertbox/zipf.html
12
Where Caching is Done on the Web
Typically, the browser does its own local caching Netscape, IE, Mozilla, etc. maintain a large (10s of
MB) cache of documents and images
Some organizations do their own caching using a proxy server Large ISPs, AOL, many companies
At the server-side, certain objects are frequently cached in memory to speed up responses
13
Proxy Servers
At an organizational level, may route all requests through a gateway Sometimes this is a firewall, other times not Sometimes it’s application-transparent, other times
it must be specified
“Proxy server” is a middleman Takes requests from client – makes requests of
server Reads response from server – forwards to client May perform shared caching of requests Very common for large ISPs, businesses, etc.
14
Cache Hits and Misses
Misses are due to several possible factors: First-time request – compulsory miss Uncacheable item
(e.g., HTTPS, item marked as uncacheable) Item expired Insufficient space in cache – capacity miss
(Structure of cache means item will be evicted – conflict miss)
15
Cache Replacement Considerations
Cost of fetching Cost of storage How often it’s been used Probability it will be accessed again When last modified When likely to expire
16
Cache Item Replacement Algorithms
Algorithms look similar to those in other disciplines (OS virtual memory, microprocessor caching): Least Recently Used Least Frequently Used Others
Size – remove largest item in cache Hybrid of LFU/LRU and size etc.
This problem is easier than in other disciplines – why?
17
Summary
Replication and caching are very common in today’s web Improve performance Replication also provides greater availability
Replication is generally done with server knowledge; caching is done (roughly) transparently
Both rely on frequent requests
18
The Future
Let’s take a brief look at current trends – what might be a few years down the pike…
Let’s step back and look at the fundamental assumptions we’ve been making in our distributed architectures What basic assumptions do we make about our nodes
in any of our basic architectures (client-server, P2P, etc.)?
What basic assumptions do we make about the data? What basic assumptions do we generally make about
distributing computation?
19
Smaller Systems
What lies in the future? RF tags, cameras,
camera phones, temperature sensors, etc.
… all interconnected by some form of networking
Sensor networks: the latest rage in distributed systems research
http://robotics.eecs.berkeley.edu/~pister/SmartDust/
20
What Can We Do with Sensor Networks?
Environmental monitoring: temperature in different parts of a building air quality monitoring equipment and staff in hospitals etc.
Law enforcement: Video feeds and anomalous behavior (J. Shi)
Research studies: Study ocean temperature, currents Monitor status of eggs in endangered birds’ nests
Fun: Record sporting events or performances from every
angle (video & audio) New immersive environments
21
Why They’re Hard
Many, many devices Power and resource constraints
Most of these devices are wireless, tiny, battery-powered Can only transmit data every so often!!
High rate of failure and error Use redundancy to overcome this
Very limited intelligence Many sensors can’t run sophisticated code Can we use the large amount of parallelism to
compensate?
Very local knowledge Know about a few nodes within proximity
22
The Problems of Focus
Languages: how do we express what we want to do with sensor networks Surprisingly effective: subset of SQL for monitoring data
from relatively simple sensors Why would this be?
Robustness: need to combine info from many sensors to account for individual errors
Routing: need to aggregate data in a power-efficient way
Streams: data is an infinitely long sequence – how do we deal with that? Summarization data structures (data is roughly according
to this distribution) Operations over “sliding windows” Again, SQL is the basis of a lot of work!
23
Will Sensor Networks Make Itto the Real World?
A definite “yes”… … because they’re already deployed: RFIDs, cameras,
traffic monitors But they aren’t yet general-purpose Implications on privacy?
We don’t yet understand how to write applications for sensor networks Want to insulate the programmer from low-level
considerations Want to satisfy performance and resource constraints What is the right level of describing an application? Web
services? Something else? Can database-style languages get us most of the way?
Sensor Net Research at Penn(Ives, Guha, Lee, Loo; Mihaylov, Liu, Jacob)
The Internet is now heavily based on “streaming” data, remote devices that do sensing Sometimes it’s “motes”, other times it’s routers, monitoring
software on servers, etc. It’s very complicated to program for all of these devices
Can we build apps that let us integrate and monitor relevant data, without worrying about device specifics?
The key idea: use query languages (think XQuery or SQL) as the basic way of requesting sensor data Extend with ideas from data integration, to support
heterogeneous sensors, combining sensor data with databases, etc.
24
ASPEN: Sensors as Distributed Data
Goal: extensible monitoring of streaming extensible monitoring of streaming data sourcesdata sources Programmer specifies computation, and the
processing and data “flow” to where it’s most “flow” to where it’s most effectiveeffective A smart optimizer knows about devices and
connectivity
Programming is data-centricdata-centric, not device-centric Everything is abstracted as tables
View of the system continuously refreshedcontinuously refreshed
Basic Approach 1/5Hide physical connectivity and location details from programmer – group data sources into abstract relations
Mic(lat, long,time,sample)
Video(lat, long,time,frame)
Basic Approach 2/5
Mic(lat, long,time,sample)
Video(lat, long,time,frame)
Represent each sensor as the source of a stream of time-varying tuples
(385301,770201,1,)
(385302,770201,1,)
(385303,770201,1,)
(385301,770202,1,)
(385301,770202,1, )
(385302,770202,1,)
(385300,770200,1,―)
(385302,770200,1,―)
(385300,770201,1,―)
(385301,770202,1, ┘)
(385300,770200,1,―)
,(385300,770200,2, ―)
,(385302,770200,2, ┘)
,(385300,770201,2, ┘)
,(385301,770202,2, ┐)
,(385300,770200,2, ―)
, (385301,770201,2,)
, (385302,770201,2,)
, (385303,770201,2, )
, (385301,770202,2,)
, (385301,770202,2,)
, (385302,770202,2,)
,…
,…
,…
,…
,…
, …
, …
, …
, …
, …
, …
Basic Approach 3/5
“Show me all of the video frames between [38°53.01’,77°02.01’] and [38°53.03’,77°02.01’] with a ”
“How many video frames with a are also near a microphone sample with sound?”
… Can also combine with lookups in tables to do data integration
e.g., “Show me video frames with a that fall within the coordinates of the conference room inRoomTable?”
e.g., “Find the ssn of Bob Smith, use this to look up histransponder ID, and show me video near him”
Support queries based on properties of the data, independent of the devices
Basic Approach 4/5Support logical views – “abstract sensors” integratingdata from different types of lower-level sensors
(385301,770201,1,), (385301,770201,2,)
(385302,770201,1,), (385302,770201,2,)
(385303,770201,1,), (385303,770201,2, )
(385301,770202,1,), (385301,770202,2,)
(385301,770202,1, ), (385301,770202,2,)
(385302,770202,1,), (385302,770202,2,)
(385300,770200,1,―), (385300,770200,2, ―),…
(385302,770200,1,―), (385302,770200,2, ┘) ,…
(385300,770201,1,―), (385300,770201,2, ┘) ,…
(385301,770202,1, ┘), (385301,770202,2, ┐) ,…
(385300,770200,1,―), (385300,770200,2, ―) ,…
AVObservations(lat, long,time,frame,sample) :- video(lat,long,time,frame), mic(lat2,long2,time,sample)
where dist(lat,long,lat2,long2) < 5m and sample > ― and frame >
Basic Approach 5/5
(385303,770201,1,), (385303,770201,2, )
(385301,770202,1, ),
(385302,770200,2, ┘) ,…
(385300,770201,2, ┘) ,…
(385301,770202,1, ┘), (385301,770202,2 ┐) ,…
(385303,770201,1,, ┘), (385301,770202,1,, ┘), (385303,770201,2,, ┘), (385303,770201,2,, ┘), (385303,770201,2, , ┐),
…
Support logical views – “abstract sensors” integratingdata from different types of lower-level sensors
AVObservations(lat, long,time,frame,sample) :- video(lat,long,time,frame), mic(lat2,long2,time,sample)
where dist(lat,long,lat2,long2) < 5m and sample > ― and frame >
Challenges We Are Addressing
Data integration has been based on static Data integration has been based on static datadata AdaptAdapt mappings, queries to stream data, including
timing, synchronization, link properties, …
Optimization of queries is hard in the Optimization of queries is hard in the simplest case, and here we need to do it simplest case, and here we need to do it in distributed fashion with limited in distributed fashion with limited knowledgeknowledge Distribute computation Distribute computation to the network, and to the
devices with the “right” position and “right” capabilities
31