+ All Categories
Home > Documents > DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations)...

DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations)...

Date post: 30-Sep-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
31
DataCutter Joel Saltz Alan Sussman Tahsin Kurc University of Maryland, College Park and Johns Hopkins Medical Institutions http://www.cs.umd.edu/projects/adr
Transcript
Page 1: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

DataCutter

Joel SaltzAlan Sussman

Tahsin KurcUniversity of Maryland, College Park

andJohns Hopkins Medical Institutions

http://www.cs.umd.edu/projects/adr

Page 2: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

DataCutter

• A suite of Middleware for subsetting and filteringmulti-dimensional datasets stored on archivalstorage systems

• Subsetting through Range Queries• a hyperbox defined in the multi-dimensional space

underlying the dataset• items whose multi-dimensional coordinates fall into the

box are retrieved.

Page 3: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

DataCutter

• Restricted processing (filtering/aggregations)through Filters• to reduce the amount of data transferred to the client• filters can run anywhere, but intended to run near (i.e.,

over local area network) storage system• based on filter-stream programming model -- to optimize

use of limited resources, such as memory and disk space

Page 4: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

DataCutterClient Client

Archival Storage System

RangeQuery

SegmentInfo.

SegmentData

IndexingService

Client Interface Service

Data Access Service

DataCutter

Filter Filter

Filtering Service

Archival Storage System

Segments: (File,Offset,Size) (File,Offset,Size)

Page 5: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

DataCutter Architecture

• Client Interface Service• Manages client connections and client requests• Manages data and information flow between

different services • Indexing Service

• Two-level hierarchical indexing -- summary anddetailed index files

• Customizable --• Default R-tree index• User can add new indexing methods

Page 6: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

DataCutter Architecture

• Filtering Service• Manages filters (registered in the system)• Users can add/run new filters

• Data Access Service• Manages storage/retrieval of data from the tertiary

storage• Low level system dependent I/O operations

Page 7: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

DataCutter -- Subsetting

• Datasets are partitioned into segments• used to index the dataset, unit of retrieval

• Indexing very large datasets• Multi-level hierarchical indexing scheme• Summary index files -- to index a group of

segments or detailed index files• Detailed index files -- to index the segments

Page 8: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

DataCutter -- Filters• Filters

• Specialized user program to process data(segments) before returning them to the client

• Filter-stream programming model• Originally developed for Active Disks environment

(Acharya, Uysal, and Saltz)• Based on stream abstraction

• A stream denotes a supply of data• Streams deliver data in fixed size buffers• Communication of a filter with its environment is

restricted to its input and output streams

• init, process, finalize interface

Page 9: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

Sample Application:• generate 3D reconstructed

viewfrom new set of sensorreadings

• compare features withreference db

Grid Configuration:• remote data server - reference

db• sensor host - large raw

readings• parallel computation farm

available• 3D reconstruction

computationallyintensive

A Motivating Scenario

WAN

Raw Datasetsensor readings

Sensor ?

Computation Farm

?

Client PC

?

Data Server

?

Reference DBfeature list

Page 10: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

A Motivating Scenario (2)

WAN

Raw Datasetsensor readings

SensorExtract raw

Client PC

View result

Data Server

Extract ref

Reference DBfeature list

Computation Farm

3D reconstruction

Application :// process relevant raw readings// generate 3D view// compute features of 3D view// find similar features in reference db// display new view and similar cases

Extract ref

Extract raw

3D reconstruction

View result

Raw Dataset

Reference DB

Page 11: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

Filters• Filters

• communicate with other filters only using streams• cannot change stream endpoints• are allowed to pre-disclose dynamic allocation of

memory/scratch space in init phase, beforeprocessing phase

• Advantages• location independence• easier scheduling of resources• filter stop and restart is defined explicitly in model

Page 12: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

Placement• The dynamic assignment of filters to

particular hosts for execution is placement(mapping)

• Optimization criteria:• Communication

• leverage filter affinity to dataset• minimize communication volume on slower connections• co-locate filters with large communication volume

• Computation• expensive computation on faster, less loaded hosts

Page 13: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

Restructuring Process

ApplicationTarget Configuration

Decompose

Placement / Schedule

Execute Application

Some setof filters

f3

f4

f5f1

f2

Page 14: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

Software Infrastructure

• Prototype implementation of filter framework• C++ language binding• manual placement• wide-area execution service• one thread for each instantiated filter

Page 15: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

Filter Framework

class MyFilter : public AS_Filter_Base {public:

int init(int argc, char *argv[ ]) { … };int process(stream_t st) { … };int finalize(void) { … };

}

Page 16: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

Filter Connectivity / Placement

[filter.A]outs = stream1 stream3[filter.B]ins = stream1outs = stream2[filter.C]ins = stream2 stream3

A

B

Cstream3

stream1 stream2

[placement]A = host1.cs.umd.eduB = host2.cs.umd.eduC = host3.cs.umd.edu

Page 17: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

Execution Service

host1.cs.umd.edu

AppExec Daemon

filter A

Application

Filter lib

EXEC

Directory Daemon

dir.cs.umd.edu:6000

Directoryname host port

**** **** ******** **** ****Application

Console

Filter lib

???.???.???.???

2. Query

SpecsFilter/Stream

Placement

1. Read

3. Exec

host2.cs.umd.edu

AppExec Daemon

filter B

Application

Filter lib

EXEC

host3.cs.umd.edu

AppExec Daemon

filter C

Application

Filter lib

EXEC

Page 18: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

Related Work

ApplicationLevel

ProgrammingModels

InfrastructureServices

ResourceLevel

Grid availableResources

Globus

User specifiedResources

Legion

Client/Server Sockets

Condor Pool

IdleResources

JavaRMI,DCOM,CORBA

NetSolve,Ninf

AppLeS

HPC++

NWS

DataCutter

HarmonyDSM MPI RPC

DPSSSRB

Page 19: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

Integrating DataCutter with theStorage Resouce Broker

Page 20: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

Storage Resource Broker (SRB)• Middleware between clients and storage

resources• Remote Access to storage resources.

• Various types :• File Systems - UNIX, HPSS, UniTree, DPSS (LBL).• DB large objects - Oracle, DB2, Illustra.

• Uniform client interface (API).

Page 21: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

Storage Resource Broker (SRB)• MCAT - MetaData Catalog

• Datasets (files) and Collections (directories) - inodes andmore.

• Storage resources• User information - authentication, access privileges, etc.

• Software package• Server, client library, UNIX-like utilities, Java GUI• Platforms - Solaris, Sun OS, Digital Unix, SGI Irix, Cray

T90.

Page 22: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

SRB/DataCutter - Prototype Implementation

• Support for Range Queries• Creation of indices over data sets (composed set

of data files)• Subsetting of data sets

• Search for files or portions of files that intersect a givenrange query

• Restricted filter operations on portions of files(data segments) before returning them to theclient (to perform filtering or aggregation to reducedata volume)

Page 23: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

File SID DBLobjID ObjSID Range Query

IndexingService

Filter Filter

Filtering Service

DataCutter

SRB/DataCutter System

Resource

User

Application Meta-data

Storage Resource Broker (SRB)

SRB I/O and MCAT APIMCAT

Application(SRB client)

DB2, Oracle, Illustra, ObjectStore HPSS, UniTree UNIX, ftp

Distributed Storage Resources

Page 24: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

SRB/DataCutter Client Interface

int sfoCreateIndex(srbConn *conn, sfoClass class, int catType, char *inIndexName, char *outIndexName, char *resourceName)

• Creating and Deleting Index

int sfoDeleteIndex(srbConn *conn, sfoClass class, int catType, char *indexName)

Page 25: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

SRB/DataCutter Client Interface• Searching Index -- R-tree index

typedef struct { int dim; /* bounding box dimensions */ double *min; /* minimum in each dimension */ double *max; /* maximum in each dimension */} sfoMBR; /* Bounding box structure */

typedef struct { sfoMBR segmentMBR; /* bounding box of the segment */ char *objID; /* object in SRB that contains the segment */ char *collectionName; /* collection where object is stored */ unsigned int offset; /* offset of the segment in the object */ unsigned int size; /* size of segment */} segmentInfo; /* segment meta-data information */

typedef struct { int segmentCount; /* number of segments returned */ segmentInfo *segments; /* segment meta-data information */ int continueIndex; /* continuation flag */} indexSearchResult; /* search result structure */

Page 26: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

SRB/DataCutter Client Interface• Searching Index -- R-tree index

int sfoSearchIndex(srbConn *conn, sfoClass class, char *indexName, void *query,

indexSearchResult *myresult, int maxSegCount)

typedef struct { int dim; double *min, *max;} rangeQuery;

int sfoGetMoreSearchResult(srbConn *conn, int continueIndex, indexSearchResult *myresult, int maxSegCount)

Page 27: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

Applying Filters

typedef struct { segmentInfo segInfo; /* info on segment data buffer after filter oper. */ char *segment; /* segment data buffer after filter is applied */} segmentData;

typedef struct { int segmentDataCount; /* #segments in segmentData array */ segmentData *segments; /* segmentData array */ int continueIndex; /* continuation flag */} filterDataResult;

Page 28: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

Applying Filtersint sfoApplyFilter(srbConn *conn, sfoClass class, char *hostName, int filterID, char *filterArg, int numOfInputSegments, segmentInfo *inputSegments, filterDataResult *myresult, int maxSegCount)

int sfoGetMoreFilterResult(srbConn *conn, int continueIndex, filterDataResult *myresult, int maxSegCount)

Page 29: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

zoom viewread_data decompress clip

Application: Virtual Microscope

• Interactive software emulation of high power lightmicroscope for processing/visualizing image datasets

• 3-D Image Dataset (100MB to 5GB per focal plane)• Client-server system organization• Rectangular region queries, multiple data chunk reply

• pipeline style processing

Page 30: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

Virtual Microscope Client

Page 31: DataCutter - cs.umd.edu€¦ · DataCutter • Restricted processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters

VM Application using SRB/DataCutter

Wide Area Network

Local Area Network

Distributed Collection of Workstations

zoomdecompress

SRB/DataCutter

read

Client

view

clip

Indexing

Client

view

read

decompress

clip

read image chunks

convert jpeg image chunks into RGB pixels

clip image to query boundaries

zoom sub-sample to the required magnification

view stitch image pieces together and display image

Distributed Storage Resources


Recommended