Post on 18-Jan-2016
transcript
Future Directions and Course Wrap-up
Zachary G. IvesUniversity of Pennsylvania
CIS 455 / 555 – Internet and Web Systems
April 29, 2008
2
Today’s Plan
Reminders: Project demos the afternoon of May 9 Project report due May 12 before the final, INCLUDING
experimental evaluation
A brief discussion on experimental methodology and some suggestions
Where the Internet and Web might be heading
A few minutes for any pressing questions … and course evaluations
3
Experiments – Show It’s So!
The general goal: to help demonstrate and show why a real-world artifact provides a benefit Versus some benchmark or naïve strategy We also want to understand why there’s a benefit
Some common kinds of experiments: Usability: some sort of user tests, versus a benchmark Performance: as we increase the workload, what
happens? Scalability: as we increase the data, devices, nodes,
what happens? Complexity: especially for things like code, what
happens as we make the task harder or bigger?
4
Experimentation In general, experiments should follow the scientific
method: Hypothesis (e.g., our method will do better than XYZ on
workloads like QWV, which are representative of domain ABC)
Experiment (examine this – may need many trials, random workloads, etc.)
Conclusion (show, with statistically significant measurements, that the hypothesis is true)
Often, the hypothesis almost goes unsaid in computer science – it’s implicit in the choice of the problem – but it is there!
Note that many attributes, e.g., elegance, style, are not very amenable to experiments
Others, like expressiveness, generally need to be proven rather than run
5
Experimental Workloads There are generally three kinds of systems
experiments: Synthetic microbenchmark: experimental runs are done
over inputs that are generated to stress a specific factor, but is not particularly realistic
Examples: a hard disk random access test; a web server’s maximum throughput
Really shows the factor of interest; can be tweaked, scaled, etc.
Synthetic based on real behavior: experimental runs are done over inputs that are modeled after real data, but perhaps generated randomly
Examples: SPEC benchmarks; TPC-W web transaction benchmark
Enables us to generate more inputs, testing scalability, etc. Real-world: traces are collected of real system behavior
over real data Disadvantage: hard to quantify or control the different factors
Experimental Methodology
Consider the important factors that you wish to examine (and demonstrate) Scalability – can typically be in terms of running time, size of the
problem, space consumed, etc. Here: performance is what matters
Break it down into individual parameters Crawl & index time; time to answer a query; etc.
Consider a workload that helps measure the parameter Crawl 1000 documents; run 50 queries 10 times apiece; etc.
Vary one parameter at a time, study effects Number of machines; number of threads per machine; etc.
Run experiment multiple times; average and show 95% confidence intervals in line (continuous) or bar (discrete) chart
6
7
The Future: Where Is the Internet and the Web Headed?
Technology trends: Larger numbers of compute nodes (clusters,
embedded devices, multicore, etc.) Bandwidth goes up, latency doesn’t Wireless and mobile devices Heterogeneous devices on the same network
General goals: Provide higher-level programming abstractions,
more automatic configuration/inference, especially as complexity goes up
Scalability, reliability, availability, …
Trend 1: Wireless Sensor Devices
Useful for environmental monitoring Interesting connection between digital &
physical world Challenges:
Many, many devices (redundancy) Limited power, CPU, bandwidth High rate of failure and error Very local knowledge – only proximity
8
http://robotics.eecs.berkeley.edu/~pister/SmartDust/
9
The Problems of Focus Hardware: more efficient, more powerful nodes Robustness: need to combine info from many
sensors to account for individual errors Routing: need to aggregate data in a power-
efficient way Streams: data is an infinitely long sequence – how
do we deal with that? Summarization data structures (data is roughly according
to this distribution) Operations over “sliding windows”
Programming: how do we express what we want to do with sensor networks Surprisingly effective: XQuery/SQL-like languages for
monitoring data (e.g., TinyDB [Madden+03])
Example: Sensor Net Research at Penn(Ives, Guha, Lee, Loo; Mihaylov, Liu, Jacob)
The Internet has many “streaming” data sources & devices Motes, routers, monitoring software on servers, etc.
Can we build apps that let us integrate and monitor relevant data, without worrying about device specifics?
The key idea: use query languages (think XQuery or SQL) as the basic way of requesting sensor data Extend with ideas from data integration, to support
heterogeneous sensors, combining sensor data with databases, etc.
Figure out how to optimize these queries
Supplement the query language with Java, etc.
10
Why Query Languages?
They make programming data-centric, not device-centric Everything abstracted as tables / XML documents Request all data values with a particular property,
etc.
They allow for simple composition (views)
They are amenable to optimization Idea: place computation at “the right” nodes in the
network
11
Basic Approach 1/5Hide physical connectivity and location details from programmer – group data sources into abstract relations
Mic(lat, long,time,sample)
Video(lat, long,time,frame)
Basic Approach 2/5
Mic(lat, long,time,sample)
Video(lat, long,time,frame)
Represent each sensor as the source of a stream of time-varying tuples
(385301,770201,1,)
(385302,770201,1,)
(385303,770201,1,)
(385301,770202,1,)
(385301,770202,1, )
(385302,770202,1,)
(385300,770200,1,―)
(385302,770200,1,―)
(385300,770201,1,―)
(385301,770202,1, ┘)
(385300,770200,1,―)
,(385300,770200,2, ―)
,(385302,770200,2, ┘)
,(385300,770201,2, ┘)
,(385301,770202,2, ┐)
,(385300,770200,2, ―)
, (385301,770201,2,)
, (385302,770201,2,)
, (385303,770201,2, )
, (385301,770202,2,)
, (385301,770202,2,)
, (385302,770202,2,)
,…
,…
,…
,…
,…
, …
, …
, …
, …
, …
, …
Basic Approach 3/5
“Show me all of the video frames between [38°53.01’,77°02.01’] and [38°53.03’,77°02.01’] with a ”
“How many video frames with a are also near a microphone sample with sound?”
… Can also combine with lookups in tables to do data integration
e.g., “Show me video frames with a that fall within the coordinates of the conference room inRoomTable?”
e.g., “Find the ssn of Bob Smith, use this to look up histransponder ID, and show me video near him”
Support queries based on properties of the data, independent of the devices
Basic Approach 4/5Support logical views – “abstract sensors” integratingdata from different types of lower-level sensors
(385301,770201,1,), (385301,770201,2,)
(385302,770201,1,), (385302,770201,2,)
(385303,770201,1,), (385303,770201,2, )
(385301,770202,1,), (385301,770202,2,)
(385301,770202,1, ), (385301,770202,2,)
(385302,770202,1,), (385302,770202,2,)
(385300,770200,1,―), (385300,770200,2, ―),…
(385302,770200,1,―), (385302,770200,2, ┘) ,…
(385300,770201,1,―), (385300,770201,2, ┘) ,…
(385301,770202,1, ┘), (385301,770202,2, ┐) ,…
(385300,770200,1,―), (385300,770200,2, ―) ,…
AVObservations(lat, long,time,frame,sample) :- video(lat,long,time,frame), mic(lat2,long2,time,sample)
where dist(lat,long,lat2,long2) < 5m and sample > ― and frame >
Basic Approach 5/5
(385303,770201,1,), (385303,770201,2, )
(385301,770202,1, ),
(385302,770200,2, ┘) ,…
(385300,770201,2, ┘) ,…
(385301,770202,1, ┘), (385301,770202,2 ┐) ,…
(385303,770201,1,, ┘), (385301,770202,1,, ┘), (385303,770201,2,, ┘), (385303,770201,2,, ┘), (385303,770201,2, , ┐),
…
Support logical views – “abstract sensors” integratingdata from different types of lower-level sensors
AVObservations(lat, long,time,frame,sample) :- video(lat,long,time,frame), mic(lat2,long2,time,sample)
where dist(lat,long,lat2,long2) < 5m and sample > ― and frame >
Challenges We Are Addressing
Data integration has been based on static Data integration has been based on static datadata AdaptAdapt mappings, queries to stream data, including
timing, synchronization, link properties, …
Optimization of queries is hard in the Optimization of queries is hard in the simplest case, and here we need to do it simplest case, and here we need to do it in distributed fashion with limited in distributed fashion with limited knowledgeknowledge Distribute computation Distribute computation to the network, and to the
devices with the “right” position and “right” capabilities
17
From Small to Big Devices: Cloud Computing
Four years ago, “grid computing” – mostly intended for science – tried to make large supercomputers available to run batch jobs “Grid” as in “electric power grid” Very difficult problems: allocating jobs to nodes,
locating resources, scheduling, etc. Many felt this didn’t succeed
Today’s buzzword: “cloud computing” Actually captures many different compute models Basics: someone else with cluster expertise
maintains large numbers of machines; they run your jobs for you
18
Cloud Computing Capabilities
Google App Engine: hosts Web apps Python-based programming environment, get/put
storage interface, connections to Google accounts, URL fetching
Automatic scaling & load balancing
Amazon: hosts a variety of compute, storage jobs Simple Storage Service (S3) – get/put access via
REST/SOAP Elastic Compute Cloud (EC2) – runs virtual machines SimpleDB – table-oriented storage / query interface
Can you name a few challenges?
19
20
The Semantic Web
Tim Berners-Lee, creator of the web: Let’s re-imagine the Web as a means of
interlinking meaning rather than just providing hyperlinks
All information will be annotated with its semantics, and it will be easy to map between different interpretations
Google, ca. 2010, might actually be able to give you answers instead of web pages
A nice dream – is it realizable?
21
The Content of the Semantic Web
Resource Description Framework -- RDF Triples, describing objects (with IDs), properties, and
values (which also may reference other objects)(NetworkBook, hasAuthor, Rexford)
(Rexford, memberOf, Person)
RDF triples describe a graph, not a tree
RDF has (several) XML representations with some built-in concepts like identity
22
RDF Example, Visualized
23
Semantics through Ontologies
Schemas describe simple relationships between concepts
An ontology is like a very sophisticated class hierarchy over which queries may make inferences: Expresses basic concepts and their relationships In the Semantic Web, express constraints on concepts in a
language called OWL: PetOwner(x) <=> Person(x) and CardinalityOf(Pet(x,y)) > 0 DogOwner(x) <=> PetOwner(x) And Exists Pet(x,y), y isa Dog CatOwner(x) <=> PetOwner(x) And Exists Pet(x,y), y isa Cat
Can ask: what classes does a person with a dog and a cat belong to? Is Person(EricMiller) a DogOwner?
24
Ontologies and the Semantic Web
The goal: start categorizing things into ontologies – map meanings of various entities Different ontologies can be defined, with
different “namespaces” Can build many different topic-specific Semantic
Webs
The Semantic Web technologies have been fairly stable for a while… So why aren’t we seeing more implementations?
25
A Possible Pitfall of Today’s SW
Data integration teaches us that there’s a huge problem in mapping between different representations In database-land, these are schemas; in the Semantic Web,
ontologies The Semantic Web doesn’t have good technologies for mapping –
simple conversions, e.g., dollars to Euros, aren’t expressible
A middle ground: Can we extend ideas from data integration, like mappings
between XML schemas, to get most of the Semantic Web’s benefits?
Something we and others are pursuing – e.g., Hyperion @ Toronto, Orchestra @ Penn, …
26
Recap
Distributed, Web-scale systems are here to stay They create many issues that are not totally
resolved, and for which there is no one answer: Heterogeneity Timing Partitioning and replication Consistency and integrity Etc.
This course tried to give you a sense of the issues and state-of-the-art – as well as the skills to go out and work in this domain I hope the amount of work we all sank into the material
(and the homeworks) will pay off for you! And stay tuned – there’s lots more to come!