Future Directions and Course Wrap-up Zachary G. Ives University of Pennsylvania CIS 455 / 555 –...

transcript

Future Directions and Course Wrap-up

Zachary G. IvesUniversity of Pennsylvania

CIS 455 / 555 – Internet and Web Systems

April 29, 2008

Today’s Plan

Reminders: Project demos the afternoon of May 9 Project report due May 12 before the final, INCLUDING

experimental evaluation

A brief discussion on experimental methodology and some suggestions

Where the Internet and Web might be heading

A few minutes for any pressing questions … and course evaluations

Experiments – Show It’s So!

The general goal: to help demonstrate and show why a real-world artifact provides a benefit Versus some benchmark or naïve strategy We also want to understand why there’s a benefit

Some common kinds of experiments: Usability: some sort of user tests, versus a benchmark Performance: as we increase the workload, what

happens? Scalability: as we increase the data, devices, nodes,

what happens? Complexity: especially for things like code, what

happens as we make the task harder or bigger?

Experimentation In general, experiments should follow the scientific

method: Hypothesis (e.g., our method will do better than XYZ on

workloads like QWV, which are representative of domain ABC)

Experiment (examine this – may need many trials, random workloads, etc.)

Conclusion (show, with statistically significant measurements, that the hypothesis is true)

Often, the hypothesis almost goes unsaid in computer science – it’s implicit in the choice of the problem – but it is there!

Note that many attributes, e.g., elegance, style, are not very amenable to experiments

Others, like expressiveness, generally need to be proven rather than run

Experimental Workloads There are generally three kinds of systems

experiments: Synthetic microbenchmark: experimental runs are done

over inputs that are generated to stress a specific factor, but is not particularly realistic

Examples: a hard disk random access test; a web server’s maximum throughput

Really shows the factor of interest; can be tweaked, scaled, etc.

Synthetic based on real behavior: experimental runs are done over inputs that are modeled after real data, but perhaps generated randomly

Examples: SPEC benchmarks; TPC-W web transaction benchmark

Enables us to generate more inputs, testing scalability, etc. Real-world: traces are collected of real system behavior

over real data Disadvantage: hard to quantify or control the different factors

Experimental Methodology

Consider the important factors that you wish to examine (and demonstrate) Scalability – can typically be in terms of running time, size of the

problem, space consumed, etc. Here: performance is what matters

Break it down into individual parameters Crawl & index time; time to answer a query; etc.

Consider a workload that helps measure the parameter Crawl 1000 documents; run 50 queries 10 times apiece; etc.

Vary one parameter at a time, study effects Number of machines; number of threads per machine; etc.

Run experiment multiple times; average and show 95% confidence intervals in line (continuous) or bar (discrete) chart

The Future: Where Is the Internet and the Web Headed?

Technology trends: Larger numbers of compute nodes (clusters,

embedded devices, multicore, etc.) Bandwidth goes up, latency doesn’t Wireless and mobile devices Heterogeneous devices on the same network

General goals: Provide higher-level programming abstractions,

more automatic configuration/inference, especially as complexity goes up

Scalability, reliability, availability, …

Trend 1: Wireless Sensor Devices

Useful for environmental monitoring Interesting connection between digital &

physical world Challenges:

Many, many devices (redundancy) Limited power, CPU, bandwidth High rate of failure and error Very local knowledge – only proximity

http://robotics.eecs.berkeley.edu/~pister/SmartDust/

The Problems of Focus Hardware: more efficient, more powerful nodes Robustness: need to combine info from many

sensors to account for individual errors Routing: need to aggregate data in a power-

efficient way Streams: data is an infinitely long sequence – how

do we deal with that? Summarization data structures (data is roughly according

to this distribution) Operations over “sliding windows”

Programming: how do we express what we want to do with sensor networks Surprisingly effective: XQuery/SQL-like languages for

monitoring data (e.g., TinyDB [Madden+03])

Example: Sensor Net Research at Penn(Ives, Guha, Lee, Loo; Mihaylov, Liu, Jacob)

The Internet has many “streaming” data sources & devices Motes, routers, monitoring software on servers, etc.

Can we build apps that let us integrate and monitor relevant data, without worrying about device specifics?

The key idea: use query languages (think XQuery or SQL) as the basic way of requesting sensor data Extend with ideas from data integration, to support

heterogeneous sensors, combining sensor data with databases, etc.

Figure out how to optimize these queries

Supplement the query language with Java, etc.

Why Query Languages?

They make programming data-centric, not device-centric Everything abstracted as tables / XML documents Request all data values with a particular property,

They allow for simple composition (views)

They are amenable to optimization Idea: place computation at “the right” nodes in the

network

Basic Approach 1/5Hide physical connectivity and location details from programmer – group data sources into abstract relations

Mic(lat, long,time,sample)

Video(lat, long,time,frame)

Basic Approach 2/5

Mic(lat, long,time,sample)

Video(lat, long,time,frame)

Represent each sensor as the source of a stream of time-varying tuples

(385301,770201,1,)

(385302,770201,1,)

(385303,770201,1,)

(385301,770202,1,)

(385301,770202,1, )

(385302,770202,1,)

(385300,770200,1,―)

(385302,770200,1,―)

(385300,770201,1,―)

(385301,770202,1, ┘)

(385300,770200,1,―)

,(385300,770200,2, ―)

,(385302,770200,2, ┘)

,(385300,770201,2, ┘)

,(385301,770202,2, ┐)

,(385300,770200,2, ―)

, (385301,770201,2,)

, (385302,770201,2,)

, (385303,770201,2, )

, (385301,770202,2,)

, (385302,770202,2,)

Basic Approach 3/5

“Show me all of the video frames between [38°53.01’,77°02.01’] and [38°53.03’,77°02.01’] with a ”

“How many video frames with a are also near a microphone sample with sound?”

… Can also combine with lookups in tables to do data integration

e.g., “Show me video frames with a that fall within the coordinates of the conference room inRoomTable?”

e.g., “Find the ssn of Bob Smith, use this to look up histransponder ID, and show me video near him”

Support queries based on properties of the data, independent of the devices

Basic Approach 4/5Support logical views – “abstract sensors” integratingdata from different types of lower-level sensors

(385301,770201,1,), (385301,770201,2,)

(385302,770201,1,), (385302,770201,2,)

(385303,770201,1,), (385303,770201,2, )

(385301,770202,1,), (385301,770202,2,)

(385301,770202,1, ), (385301,770202,2,)

(385302,770202,1,), (385302,770202,2,)

(385300,770200,1,―), (385300,770200,2, ―),…

(385302,770200,1,―), (385302,770200,2, ┘) ,…

(385300,770201,1,―), (385300,770201,2, ┘) ,…

(385301,770202,1, ┘), (385301,770202,2, ┐) ,…

(385300,770200,1,―), (385300,770200,2, ―) ,…

AVObservations(lat, long,time,frame,sample) :- video(lat,long,time,frame), mic(lat2,long2,time,sample)

where dist(lat,long,lat2,long2) < 5m and sample > ― and frame >

Basic Approach 5/5

(385303,770201,1,), (385303,770201,2, )

(385301,770202,1, ),

(385302,770200,2, ┘) ,…

(385300,770201,2, ┘) ,…

(385301,770202,1, ┘), (385301,770202,2 ┐) ,…

(385303,770201,1,, ┘), (385301,770202,1,, ┘), (385303,770201,2,, ┘), (385303,770201,2,, ┘), (385303,770201,2, , ┐),

Support logical views – “abstract sensors” integratingdata from different types of lower-level sensors

AVObservations(lat, long,time,frame,sample) :- video(lat,long,time,frame), mic(lat2,long2,time,sample)

where dist(lat,long,lat2,long2) < 5m and sample > ― and frame >

Challenges We Are Addressing

Data integration has been based on static Data integration has been based on static datadata AdaptAdapt mappings, queries to stream data, including

timing, synchronization, link properties, …

Optimization of queries is hard in the Optimization of queries is hard in the simplest case, and here we need to do it simplest case, and here we need to do it in distributed fashion with limited in distributed fashion with limited knowledgeknowledge Distribute computation Distribute computation to the network, and to the

devices with the “right” position and “right” capabilities

From Small to Big Devices: Cloud Computing

Four years ago, “grid computing” – mostly intended for science – tried to make large supercomputers available to run batch jobs “Grid” as in “electric power grid” Very difficult problems: allocating jobs to nodes,

locating resources, scheduling, etc. Many felt this didn’t succeed

Today’s buzzword: “cloud computing” Actually captures many different compute models Basics: someone else with cluster expertise

maintains large numbers of machines; they run your jobs for you

Cloud Computing Capabilities

Google App Engine: hosts Web apps Python-based programming environment, get/put

storage interface, connections to Google accounts, URL fetching

Automatic scaling & load balancing

Amazon: hosts a variety of compute, storage jobs Simple Storage Service (S3) – get/put access via

REST/SOAP Elastic Compute Cloud (EC2) – runs virtual machines SimpleDB – table-oriented storage / query interface

Can you name a few challenges?

The Semantic Web

Tim Berners-Lee, creator of the web: Let’s re-imagine the Web as a means of

interlinking meaning rather than just providing hyperlinks

All information will be annotated with its semantics, and it will be easy to map between different interpretations

Google, ca. 2010, might actually be able to give you answers instead of web pages

A nice dream – is it realizable?

The Content of the Semantic Web

Resource Description Framework -- RDF Triples, describing objects (with IDs), properties, and

values (which also may reference other objects)(NetworkBook, hasAuthor, Rexford)

(Rexford, memberOf, Person)

RDF triples describe a graph, not a tree

RDF has (several) XML representations with some built-in concepts like identity

RDF Example, Visualized

Semantics through Ontologies

Schemas describe simple relationships between concepts

An ontology is like a very sophisticated class hierarchy over which queries may make inferences: Expresses basic concepts and their relationships In the Semantic Web, express constraints on concepts in a

language called OWL: PetOwner(x) <=> Person(x) and CardinalityOf(Pet(x,y)) > 0 DogOwner(x) <=> PetOwner(x) And Exists Pet(x,y), y isa Dog CatOwner(x) <=> PetOwner(x) And Exists Pet(x,y), y isa Cat

Can ask: what classes does a person with a dog and a cat belong to? Is Person(EricMiller) a DogOwner?

Ontologies and the Semantic Web

The goal: start categorizing things into ontologies – map meanings of various entities Different ontologies can be defined, with

different “namespaces” Can build many different topic-specific Semantic

The Semantic Web technologies have been fairly stable for a while… So why aren’t we seeing more implementations?

A Possible Pitfall of Today’s SW

Data integration teaches us that there’s a huge problem in mapping between different representations In database-land, these are schemas; in the Semantic Web,

ontologies The Semantic Web doesn’t have good technologies for mapping –

simple conversions, e.g., dollars to Euros, aren’t expressible

A middle ground: Can we extend ideas from data integration, like mappings

between XML schemas, to get most of the Semantic Web’s benefits?

Something we and others are pursuing – e.g., Hyperion @ Toronto, Orchestra @ Penn, …

Distributed, Web-scale systems are here to stay They create many issues that are not totally

resolved, and for which there is no one answer: Heterogeneity Timing Partitioning and replication Consistency and integrity Etc.

This course tried to give you a sense of the issues and state-of-the-art – as well as the skills to go out and work in this domain I hope the amount of work we all sank into the material

(and the homeworks) will pay off for you! And stay tuned – there’s lots more to come!

Future Directions and Course Wrap-up Zachary G. Ives University of Pennsylvania CIS 455 / 555 –...

Documents