Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | solomon-chapman |
View: | 41 times |
Download: | 0 times |
Survey of Emerging IT Trends and Technologies
Chaitan Baru
Monday, 10th Aug
1
OUTLINE
• Trends in data sharing– And, Discovery/Search
• Trends in service-oriented architectures
• Trends in computing and data infrastructure
• The road ahead
2
Geoinformatics Use Cases• “…a use has access from a terminal to vast stores of
data of almost any kind, with the easy ability to visualize, analyze and model those data.”
• “For a given region (i.e. lat/long extent, plus depth), return a 3D structural model with accompanying geophysical parameters and geologic information, at a specified resolution”
3
Implied IT Requirements
• Search and discovery of resources
• Integration of heterogeneous 3D / 4D Earth Science data
• Integration of data with tools
• Analysis and Visualization– Ability to feed data to tools, and analyze &
visualize model outputs
• (data-centric view…)
4
Search and Discovery
• Searching “structured data”, i.e. metadata catalogs
5
Search
Structured metadatacatalogs
Search and Discovery
• Searching “unstructured data”, i.e. the Web
6
Search
• Structured databases are a major component of the “Deep Web”
The Web
Combined Search and Discovery
7
Search
The WebStructured metadata
catalogs
Advanced Search• Proposed:
– Geoscience Knowledge System, GeoKnowSys
– Built using Yahoo Build Your Own Search (BOSS) service
• E.g. See wolframalpha.com
8
Advanced Search: PaleoLit
• Research project at Dept of CS, CMU– Dr. Judith Gelernter and Prof. Jamie Carbonell
• Use ontologies to match search requests to related publications
• Demo…
9
Informatics Issues: The Informatics Progression
IT Cyber
Infrastructure
Cyber Informatics
Core Informatics
Science Informatics,
aka
Xinformatics
Science, SBAs
Informatics
Courtesy: Prof. Peter Fox, RPI, CSIG’08
The Computer Science / Domain Science continuum
11
Computer IT Geoinformatics Domain DomainScience Standards Standards Standards ScienceTopics Topicse.g. Database e.g. ODBC, e.g. Ontologies, e.g. domain e.g. geologySystems, XML GeoSciML vocabulariesSemistructure data definitions (Geologic Time,
rock description,…)
The data interoperability onion
12
Social NetworksSemantics
SyntaxSystems
Social NetworksSemantics
SyntaxSystems
• System Interop– Approaches: e.g., ODBC, JDBC, Java, Web services, …
– Purview of: Computer Science
• Syntactic– Approaches: Schema standards
– Purview of: Standards organizations, domain science
repositories, data archives
• Semantic– Approaches: Controlled vocabularies, thesaurii, domain ontologies
– Purview of: Domain scientists
• Social Networks– Approaches: recommendation systems
– Purview of: social networking software (CS and domain science, data driven)
Software interoperability onion
13
Social NetworksSemantics
SyntaxSystems
• System Interop– Approaches: e.g., REST, Web services
• Syntactic– Approaches: e.g., SOAP, WSDL
• Semantic– Approaches: Controlled vocabularies, thesaurii, domain ontologies
– Purview of: Domain scientists
• Social Networks– Approaches: recommendation systems
– Purview of: social networking software
• Service orchestration via worflow systems
Geologic Map Integration Geologic Map Integration
Data Mediation• Dealing with heterogeneities in (distributed) data sources
– Data may be in different “administrative domains” Manage authentication
– Data schemas may be different among sources
– Terminologies may be different among sources
– Terminologies may be different among sources and user
– Software infrastructure (“stack”) may be different
• Solve the problem with “middleware”– Layers of software between the original application and the end user
• Mediator– Middleware that bridges across heterogeneities without requiring sources
to change
AZNM CO
UT
NV
ID
MT MT
WYShapefile(ESRI)
PostGIS
Oracle
Windows Linux iMac
DB2 SRB
GML
• Operating system• File storage• Database schemas• Data Semantics
Heterogeneities
A Data Integration Example: Geologic Maps
FORMATIONUNIT_NAMEROCK_TYPEERASYSTEMSERIESLITH
ROCK_TYPEPERIOD
AZNM CO
UT
NV
ID
MT MT
WYWMS
WMS
WMS
WMS
WMS
• Integrated presentation• Uniform syntactical structure• Uniform spatial definition
Advantages
• Each resource may use a different schema• Difficult to build a a uniform query interface for multiple resources.
Problems
Adopting WMS/WFS: Can provide Syntactic Integration
GeoSciML: Can Provide Schema Integration
AZNM CO
UT
NV
ID
MT MT
WYGeoSciML
GeoSciML
GeoSciML
GeoSciML
GeoSciML
• Integrated schema• Partial integrated semantics
Advantages
• Each resource may use different vocabulary and semantic model.
Problem
British Rock Classification
Multi-hierarchical Rock Classification
Semantic Mediation with GeoSciML
NMCO
British Rock Classification
Multi-hierarchical Rock
Classification
GeoSciML
Application Ontology
Semantic Mapping
Mappings may also be needed between the data and the application ontology
E.g., say, mapping 240 mya to Mesozoic
Query Rewriting:Example: A Rock Classification
Ontology
Composition
Genesis
Fabric
Texture
Query: Concept Expansion
Composition
Concept expansion:Concept expansion:• what else to look for when what else to look for when user asks for ‘Mafic’user asks for ‘Mafic’
Query: Concept Generalization
Composition
Generalization:Generalization:• finding data that are ‘like’ finding data that are ‘like’ X and YX and Y
Ontology-based Geologic Map Integration: Implemented in GEON
Show formations where AGE = ‘Paleozic’
(without age ontology)
Show formations where AGE = ‘Paleozic’
(without age ontology)
Show formations where AGE = ‘Paleozic’
(with age ontology)
Show formations where AGE = ‘Paleozic’
(with age ontology)
+/- a few hundred million years
domainknowledge
domainknowledge
Knowledge
repres
entation
Geologic Age
ONTOLO
GY
NevadaNevada
<odal:NamedIndividuals odal:id="RockSample" odal:database="VTDatabase"> <odal:Class odal:resource="http://geon.vt.edu#RockSample" /> <odal:Table>Samples</odal:Table> <odal:Table>RockTexture</odal:Table> <odal:Table>RockGeoChemistry</odal:Table> <odal:Table>ModalData</odal:Table> <odal:Table>MineralChemistry</odal:Table> <odal:Table>Images</odal:Table> <odal:Column>ssID</odal:Column> </odal:NamedIndividuals>
GUIgenerate to ODAL
processor
The values in the column ssID of the tables Samples, RockTexture, RockGeoChemistry, ModalData,MineralChemistry and Images represent instances of RockSample
•ODAL: Ontological Database Annotation Language• Create a partial model of ontologies from database
ODAL, SOQL, and Data Integration Carts™
SOQL: Simple Ontology Query LanguageQuery single or many resources
• via ontologies (i.e., high level logical views)• independent of physical representation (i.e. schemas)
RockSample Location
ValueWithUnit float
location
hasSiO2
valuelat long
unit
string
SELECT X.location.*; FROM RockSample X WHERE X.location.lat > 60 AND X.location.long > 100 AND X.hasSiO2.value < 30 AND X.hasSiO2.unit =‘weightPercetage’
GUI
generateto SOQLprocessor
Issues in sharing data: Primary vs secondary (derived)
26
Collect Data
Process and Visualize
Share Results
Share data
Share intermediateresults
Sources of Data• Distributed data collections
– By individual PIs
– “Informal” sharing, e.g. via social network
– “Formal” sharing, e.g. via submission to community data archives / databases
• Centralized data collections– E.g. via a large project (standardized protocols)
– By agencies (internal protocols)
• Metadata to the rescue– Data description standards
– Process description standards (workflows)
• State Surveys and USGS are major sources
27
Major Interoperability Efforts
• OneGeology.org– International initiative of
geological surveys to create dynamic geological map data available via the web.
• US Geoscience Information Network (US GIN)– Led by Lee Allison, AZGS
28
Federating Metadata Catalogs
• Local vs Community “View”– Individual data providers may choose to “export” a
community view
• Direct access to the source may still provide more “rich” access to data
• Federated Catalogs– The Geosciences Information Network, GIN approach
– Adopt standards for catalog content (ISO) and implementation (CSW)
29
Interoperation between GEON and GEO GRID
• Implement CSW interfaces– Collaboration with the NSF PRAGMA project (Pacific Rim Assembly for Grid
Middleware Applications)
600 scenes/day
Storage
GeogridCatalog
Catalog Service
Web
WMS Server
WMS URL
SRB
GEON Catalog
Catalog Service
Web Adapter
WMS Server
WMS URL
ADN
CSWREQUEST
RESPONSE
CSW Composite
Service
CSWREQUEST
RESPONSE RESPONSE
GEON GEO Grid
Integration & Visualization of 3D/4D data
–Derived 3D volumetric model–Multiple isosurfaces with different transparencies–Slices through the volume–Variable gridding: data typically has lower resolution at greater depths
–2D surface data: Topography (“2.5D”) Satellite imagery, street maps, geologic maps, fault lines, and other derived features etc.
–Bore hole or well data and point observations.
“For a given region (i.e. lat/long extent, plus depth), return a 3D structural model with accompanying physical parameters of density, seismic velocities, geochemistry, and geologic ages, using a cell size of 10km”
OpenEarth Framework Goals
Geoscience Integration:
• Data types - topography, imagery, bore hole samples, velocity models from seismic tomography, gravity measurements, simulation results…
• Data coordinate spaces and dimensionality - 2D and 3D spatial representations and 4D that covers the range of
geologic processes (EQ cycle to deep time).
OpenEarth Framework GoalsStructural Integration:
• Data formats – shapefiles, NetCDF, GeoTIFF, and other formal and defacto standards.
• Data models - 2D and 3D geometry to semantically richer models of features and relationships between those features.
• Data delivery methods & Storage Schemes- local files to database queries, web services (WMS, WFS) and services for new data types (large tomographic volumes, etc.).
OEF Philosophy
• OEF focused on integrating data spanning the geosciences.
• Open software architecture and corresponding software that can properly access, manipulate and visualize the integrated data.
• Open source to provide the necessary flexibility for academic research and to provide a flexible test bed for new data models and visualization ideas.
OEF Architecture
OEF ArchitectureData Integration Services:
– Designed to support rapid visualization of integrated datasets
– operations to grid data, resample it at multiple resolutions and subdivide data to better support progressive changes to the display as the user pans and zooms
OEF ArchitectureVisualization Tools:
– Run on the user's computer, dynamically query spatial and temporal data from the OEF services
– Uses 3D graphics hardware for fast display
– Open architecture supports multiple visualization tools authored throughout the community (e.g GEON IDV)
– New viz capabilities developed as necessary
OEF Visualization
The software services stackExample: GEON
Pushing down the service interface
Compute nodes Disk Storage
Compute nodes Disk Storage
Software as a Service:At different levels of software
SaaS
PaaS
• Software as a Service: SaaS– E.g., Google Apps, Salesforce.com, SAP, …
• Infrastructure as a Service, IaaS– E.g., Amazon EC2, …
• Platform as a Service, PaaS
IaaS
The evolving computational architecture
• Mainframe computers (institutional computing)
• Minicomputers (departmental computing)
• Workstations (laboratory computing)
• Laptops (personal computing)
• …back to the future..??
41
Cloud Computing: A meeting of trends
Data Volumes
Price/performanceof computing
platformsCost of
Ownership
Cloud Computing Origins
• Cloud computing: Many definitions– Here’s one: Use of remote data centers to manage scalable, reliable, on-
demand access to applications
• Origins– Goes back to the need by Web search engines to inexpensively process all
the pages on the Web
– Done by creating a grid of datacenters and processing data in parallel across them
– Development of a parallel data programming environment by Google: MapReduce
• Data + cloud computing– what about remote centers for scalable, reliable, on-demand access to
data?
Cloud Computing
• A different pricing model– No upfront cost of acquisition. Rent don’t buy.
• Can access 1000’s of processors / disks– Scalability– “Elastic computing”
• A different model for dealing with system failures– Retry, loose consistency, …
Cloud computing for data
• Data as a service: what is the abstraction for storage?– Table, Blob, Queue
– …??
• Describing characteristics of the data– Metadata about storage to specify policies to be applied
– Security, reliability, performance, etc
• Scaling to meet application needs– Large configurations
– Dealing with virtualization
– New failure models• Retry, loose consistency
Storage as a Service• Amazon S3: An example
– Charges for Storage, Data Transfer, and Requests (e.g. PUT, COPY, POST, LIST, GET)
• Issues– Bandwidth to storage
– Quality of Service
– Storage Elasticity
– Privacy / security
• Standardization efforts– Storage Networking Industry Assocation (SNIA) Technical Working Group (TWG)
on Cloud Storage has just started
• Important Issues– Metadata for storage
– Scaling up to large dataset sizes
The two sides of Cloud Computing
• Large distributed infrastructure– “Everything is in the cloud”
– Interesting as a proposition for the IT operations of an enterprise
– Cloud companies would like to reach deep into enterprise IT
– “Our business is not the entrenched data centers in current large organizations, but the new companies…”
• Large-scale infrastructure in the Datacenter– Seeding the cloud
– Shared-nothing parallelism
– Data on the cheap…a la Google
The NSF Cluster Exploratory (CluE) Program
• Google-IBM-NSF Cluster– Well over a thousand processors
• When fully built out, will comprise approximately 1,600 processors
– Terabytes of memory
– Hundreds of terabytes of storage
• Open source software– Linux and Apache Hadoop
• IBM Tivoli– System management, monitoring and dynamic resource provisioning
• A platform for “apples-to-apples” comparisons– Can reserve time on nodes for exclusive access
Our CluE Project
• Project (PI: Baru; co-PI: Krishnan)– Performance Evaluation of On-Demand Provisioning Strategies for Data
Intensive Applications
• Investigate hybrid software model– Database system / Hadoop system
– Some parts of the application require features provided by a DBMS• Transactional capability, full SQL support
– Other parts of the application can exploit Hadoop model• Very large data sets
• Data parallel processing
• Loose consistency models
• Price / performance is an issue– Including energy costs
San Andreas Fault LiDAR Dataset:
Data Access Patterns• B4 Dataset
Experiments
• “On-demand” database vs Hadoop
• SQL vs Hadoop
• Energy consumption as a factor in price/performance
• Platforms to be used
• Google-IBM cluster
• OpenCirrus testbed
• Triton resource
The Road Ahead• Advanced search engines
– Search structured and unstructured data
– Deal with display of heterogeneous results
– Show provenance of data
• Sophisticated tools for 3D and 4D data integration– Combination of “server-side” processing and caching and
client-side interaction and visualization
• Service-oriented architecture– Applications and IT infrastructure available as services
– Perhaps some of them in “the Cloud”
52
53
Dealing with very large data
• Either the data can be partitioned into segments and processed in parallel– Shared-nothing parallelism
• Or not– Shared memory systems
Parallel Processing of Large Data
D
P
M
P P P P
Shared Memory
Network
Shared Nothing
D
P
M
D
P P P P
M M M M
D D D
Shared Nothing
Dataset
Partitioning Strategy
D
P P P P
M M M M
D D DD
M
P
Data partitioning strategies• Round-robin
– Equal distribution across nodes by data volume
• Hash– all data with the same
key value go to same node
• Range– all data within a
range of values go to the same node Dataset
Partitioning Strategy
D
P P P P
M M M M
D D DD
M
P
MapReduce / Hadoop• Programming environment for very large scale
data processing
• Managing task executions and data transfers in a shared nothing environment– MapReduce: Infrastructure to support data scatter / gather
– Distributed data repository (“file system”)• Google File System (GFS)
• Hadoop Distributed File System (HDFS)
– Round-robin partitioning of data
• MapReduce– Google’s proprietary implementation
• Hadoop– Apache, open source implementation
• Hadoop vs databaseMapReduce execution
MapReduce vs Database• Database
– Partition “base tables” into N partitions
– Intermediate data can be “re-partitioned”
– Intermediate data can be combined
– Well-defined algebra for data manipulation (SQL)
• MapReduce / Hadoop– Partition input data file into M splits
– Intermediate data are re-hashed
– Intermediate data can be “combined”
– Java programs
• Cost of dynamic vs static partitioning– Run time costs
– Storage costs
• Optimal partitioning– Query and Workload dependent
– How to measure any deviations from the optimal?
– When to repartition?
USGS Role in USGS Role in GeoinformaticsGeoinformatics
Fundamental: Develop, maintain, make accessible:Fundamental: Develop, maintain, make accessible: Long-term national and regional geologic, Long-term national and regional geologic,
hydrologic, biologic, and geographic databaseshydrologic, biologic, and geographic databases Earth and planetary imagery Earth and planetary imagery Open-source models of the complex natural Open-source models of the complex natural
systems and human interaction with that systemsystems and human interaction with that system Physical collections of earth materials, biologic Physical collections of earth materials, biologic
materials, reference standards, geophysical materials, reference standards, geophysical recordings, paper records.recordings, paper records.
National geologic, biologic, hydrologic, and National geologic, biologic, hydrologic, and geographic monitoring systems geographic monitoring systems
Standards of practice for the geologic, Standards of practice for the geologic, hydrologic, biologic, and geographic scienceshydrologic, biologic, and geographic sciences
Source: Presentation by Dr. Linda Gundersen, USGS, at Geoinformatics 2007, San Diego, CA.
USGS Role in USGS Role in GeoinfomaticsGeoinfomatics
All activities: Data creation, modeling, All activities: Data creation, modeling, monitoring, collections, standards etc. monitoring, collections, standards etc. Must be done in cooperation and Must be done in cooperation and collaboration with the public and collaboration with the public and governmental, academic, and private governmental, academic, and private sector partners and stakeholders.sector partners and stakeholders.
A critical USGS role: A critical USGS role: facilitate bringing communities facilitate bringing communities
together!together!Source: Presentation by Dr. Linda Gundersen, USGS, at Geoinformatics
2007, San Diego, CA.
Data Collections versus Data Collections versus Communities of PracticeCommunities of Practice
Geoinformatics must evolve beyond the Geoinformatics must evolve beyond the accumulation of data, models, and standards accumulation of data, models, and standards to become the framework for a to become the framework for a community community of practiceof practice in the natural sciences. in the natural sciences.
Etienne Wegner and Jean Lave coined the Etienne Wegner and Jean Lave coined the term and developed the learning theory of term and developed the learning theory of communities of practice – that we learn not communities of practice – that we learn not only as individuals but as communities. By only as individuals but as communities. By engaging in communities of practice we engaging in communities of practice we increase our capacity and innovation as well increase our capacity and innovation as well as leverage our support for areas of interest. as leverage our support for areas of interest.
Source: Presentation by Dr. Linda Gundersen, USGS, at Geoinformatics 2007, San Diego, CA.
Creativity, Learning, and Creativity, Learning, and InnovationInnovation
A community of practice is not merely a A community of practice is not merely a community with a common interest. But community with a common interest. But are practitioners who are practitioners who share experiences share experiences and learn from each otherand learn from each other. They develop a . They develop a shared repertoireshared repertoire of resources: experiences, of resources: experiences, stories, tools, vocabularies, ways of stories, tools, vocabularies, ways of addressing recurring problems. This takes addressing recurring problems. This takes time and sustained interactiontime and sustained interaction. Standards . Standards of practice and reference materials will of practice and reference materials will grow out of this. grow out of this. But the critical benefits But the critical benefits include: creating and sustaining include: creating and sustaining knowledge, leveraging of resources, and knowledge, leveraging of resources, and rapid learning and innovation.rapid learning and innovation.
Source: Presentation by Dr. Linda Gundersen, USGS, at Geoinformatics 2007, San Diego, CA.
1000’s of National and 1000’s of National and Regional DatabasesRegional Databases
The National Map – topographic, The National Map – topographic, elevation, orthoimagery, elevation, orthoimagery, transportation hydrography etc.transportation hydrography etc.
Geospatial One Stop-portalGeospatial One Stop-portal MRDATA – Mineral Resources and MRDATA – Mineral Resources and
Related DataRelated Data The National Geologic Map Database The National Geologic Map Database
stnadardized community collection of stnadardized community collection of geologic mappinggeologic mapping
National Water Information System - National Water Information System - NWISWebNWISWeb
National Geochemical Survey National Geochemical Survey Database (PLUTO, NURE)Database (PLUTO, NURE)
National Geophysical Database National Geophysical Database (aeromag, gravity, aerorad)(aeromag, gravity, aerorad)
Earthquake CatalogsEarthquake Catalogs North American Breeding Bird SurveyNorth American Breeding Bird Survey National Vegetation/speciation mapsNational Vegetation/speciation maps National Oil and Gas AssessmentNational Oil and Gas Assessment National Coal Quality InventoryNational Coal Quality Inventory
Source: Presentation by Dr. Linda Gundersen, USGS, at Geoinformatics 2007, San Diego, CA.