Data Science class presentation: POLARIS
PolarisArea: Data visualization & Interface designA System for Query, Analysis and Visualization of Multi-dimensional Relational DatabasesBy Chris Stolte and Pat Hanrahan, Stanford University
Presenter: Ganesh ViswanathanCIS6930: Data Science: Large-scale Advanced Data AnalysisUniversity of FloridaSeptember 15, 2011
MotivationLarge multi-dimensional databases have become very commoncorporate data warehousesAmazon, Walmart, scientific projects: Human Genome ProjectSloan Digital Sky Survey
A major challenge for these huge databases is to extract meaning from the data they contain such as:to discover structure,to find patterns, andto derive causal relationship.
Need tools for exploration and analysis of these databases2The exploratory analysis process is one of hypothesis, experiment, and discovery.The path of exploration is unpredictable and the analysts need to be able to rapidly change both what data they are viewing and how they are viewing that data.Existing Tools: Charts
typically provide a gallery of chartshard to iteratively exploresimple charts can display few dimensions
Existing tools: Pivot Tables
common interface to data warehousessimple interface based on drag-and-dropgenerate text tables from databases:
-- The most popular interface to multi-dimensional databases.-- Allow the data cube to be rotated so that different dimensions of the dataset may be encoded as rows or columns of the table.-- The remaining dimensions are aggregated & displayed as numbers in the cells of the table.-- Cross-tabulations and summaries are then added to the resulting table of numbers.-- Finally, graphs may be generated from the resulting tables.
4Pivot TablesMulti-dimensional databases are often treated as n-dimensional data cubes.
Pivot Tables allow rotation of multi-dimensional datasets, allowing different dimensions to assume the rows and columns of the table, with the remaining dimensions being aggregated within the table.TIME1 23 4 76 5 PRODUCTToothpaste JuiceColaMilk CreamSoap REGIONWS N Dimensions
Pivot Table Example: Baseball dataBy dragging and dropping the dimensions to and from the left-hand column, top row, upper-left corner, and central data area (where the remaining dimensions are aggregated), one can change the Pivot Table view. Any of these views can be subsequently graphed.
6
Pivot Table Example: Baseball data
Pivot Table Example: Baseball data
Pivot Table Example: Baseball dataRelational databasesEach row in table = basic entity (tuple)Each column represents a fieldFields can be ordinal, or quantitative
Relational Data Schema
Structural description of data setsPrimitives: attributes, tuples and relations
MotivationRelational data schema enables flexible database design
No corresponding flexible ways to construct effective UI and visualizationunique data schema unique visualization/coordinationdatabase keeps changingdifferent views for same dataMismatch in design capabilitiesRelational DatabasesTraditional VisualizationDesign GoalData designVisualization designDesign MethodData schemaProgram codeDesignerData ownerProgrammer onlyDesign ChangeRapid, dynamicSlow, staticAdaptabilityFlexibleBrittleRequirements on UI for Analysis and ExplorationData dense displays: display both many tuples & many dimensions
Multiple display types: different displays suited to different tasks
Exploratory interfaces: rapidly change data transformations and views14Data dense: Analysts need to be able to create visualizations that will simultaneously display many dimensions of large subsets of the data.Multiple display types: Analysis consists of many different task such as discovering correlation between variables, finding patterns in the data, locating outliers and uncovering structure.An analysis tool must be able to generate displays suited to each of these tasks.Exploratory interface: The analysis process is often an unpredictable exploration of the data. Analysts must be able to rapidly change what data they are viewing and how they are viewing that dataPolarisPolaris is an interface for the exploration of multi-dimensional databases that extends the pivot table interface to directly generate a rich, expressive set of graphical displays.Polaris Design GoalsGenerate rich table-based graphical displays rather than tables of textSingle conceptual model for both graphs and tablesPreserve ability to rapidly construct displaysInteractive analysis and exploration versus static visualizationSimple, consistent interface Ease analysis and exploration:Want to extract meaning from dataProcess of hypothesis, experiment, and discoveryPath of exploration is unpredictable
16Mention 2 points then go on. Excel Pivot tables provide a simple interface for building text-based tables
Graphs require multiple steps: different interfaces and conceptual models
Want to unify tables, graphs, and database queries in one interface
Features of PolarisBuilds tables using an algebraic formalism involving the fields of the database.
Each table consists of layers and panes, and each pane may be a different graphic.
An interface for constructing visual specifications of table-based graphical displays.
The state of the interface can be interpreted as a visual specification of the analysis task and automatically compiled into data and graphical transformations. Features of PolarisThe visual specifications can be rapidly & incrementally developed, giving the users visual feedback as they construct complex queries & visualization.
Ability to generate a precise set of relational queries from the visual specifications.
Users can incrementally construct complex queries, receiving visual feedback as they assemble and alter the specifications.
Features of PolarisAddresses these demands by providing an interface for rapidly and incrementally generating table-based displays.A table consists of a number of rows, columns, and layers.Each table axis may contain multiple nested dimensions.Each table entry, or pane, contains a set of records that are visually encoded as a set of marks to create a graphic.Visualizing Multidimensional DataSeveral characteristics to tables make them particularly effective for displaying multi-dimensional data:
Multivariate - multiple dimensions of the data can be explicitly encoded in the structure of the table, enabling the display of high-dimensional data.
Comparative - tables generate small multiple displays of information, which are easily compared, exposing patterns and trends across dimensions of the data.
Familiar - Statisticians are accustomed to using tabular displays of graphs, such as scatterplot matrices and Trellis displays, for analysis. Pivot Tables are a common interface to large data warehouses.
multivariate: multiple dimensions can be encoded in the structure of the tablecomparative: tables generate small-multiple displays of informationfamiliar: users are accustomed to tabular displays
20
Polaris Display: UIDesign Decision: Use a FormalismWhy a formalism?unification: unify tables and graphsexpressiveness: build visualizations designers did not think ofinterface simplicity: clearly defined semantics and operationscode simplicity: composable language versus monolithic objectsdeclarative: can state what, not how - allows for optimization, etc.
22Polaris FormalismInterface interpreted as visual specification in formal language that defines:table configurationtype of graphic in each paneencoding of data as visual properties of marks
Specification compiled into data & graphical transformations to generate displayExample specification
table configuration}Formalism Example: Specifying Table ConfigurationsInterface: define table configuration by dropping fields on shelves
Formalism: shelf content interpreted as expressions in table algebra
Can express extremely wide range of table configurationsFormalism Example: Specifying Table ConfigurationsOperands are the database fieldseach operand interpreted as a set {}quantitative and ordinal fields interpreted differently
Three main operators:concatenation (+), cross (X), nest (/)Additionally: dot (.) operator26Table Algebra: OperandsOrdinal fields - interpret domain as a set that partitions table into rows and columns: QUARTER = {Quarter1,Quarter2,Quarter3,Quarter4} Quarter 1Quarter 2Quarter 3Quarter 431,40037,12035,60030,900Profit (in thousands)102030405060 Quantitative fields treat domain as single element set and encode spatially as axes: PROFIT = {P[0 - 65,000]}
27Table Algebra: Concatenation (+) OperatorOrdered union of set interpretations:QUARTER + PRODUCT_TYPE = {QTR1,QTR2,QTR3,QTR4} + {Coffee, Tea} = {QTR1,QTR2,QTR3,QTR4, Coffee, Tea} Quarter 1Quarter 2Quarter 3Quarter 431,40037,12035,60030,900CoffeeTea37,12030,900PROFIT + SALES = {P[0-65,000], S[0-125,000]}Profit (in thousands)102030405060Sales (in thousands)20406080100120Table Algebra: Cross (X) OperatorCross-product of set interpretations:QUARTER X PRODUCT_TYPE = PRODUCT_TYPE X PROFIT = Profit (in thousands)102030405060Profit (in thousands)102030405060CoffeeTeaQuarter 1Quarter 2Quarter 3Quarter 4CoffeeTeaCoffeeTeaCoffeeTeaCoffeeTea{(Qtr1,Coffee), (Qtr1, Tea), (Qtr2, Coffee), (Qtr2, Tea), (Qtr3, Coffee), (Qtr3, Tea), (Qtr4, Coffee), (Qtr4,Tea)} Table Algebra: Nest (/) OperatorQUARTER X MONTHwould create entry twelve entries for each quarter i.e. (Qtr1, December)
QUARTER / MONTHwould only create three entries per quarterbased on tuples in database not semanticscan be expensive to compute
Polaris Display: UIPolaris Display Drag and drop fields from database scheme onto shelves
May combine multiple data sources, each data source mapping to a separate layer
Multiple fields may be dragged onto each shelf
Data may be grouped or sorted, and aggregations may be computedPolaris DisplaySelecting a single mark in a graphic displays the values for the mark
Can lasso a set of marks to brush records
Marks in the graphics use retinal propertiesRetinal PropertiesOrdinal/nominal mapping vs. quantitative mappingProperties: Shape, size, orientation, and color.When encoding a quantitative variables, should only vary one aspect at a time
Visual SpecificationIs the configuration of the fields of the tables on shelvesUser does this by dragging and dropping fields onto shelvesControls:Mapping of data sources to layers# of rows, columns, and layers, and relative orderSelection of tuples from the databaseGrouping of data within a paneType of graphic displayed in each paneMapping of data fields with retinal propertiesGraphicsOrdinal-Ordinal: e.g. the tablethe axis variables are typically independent of each otherOrdinal-Quantitative: e.g. bar chartthe quantitative variable is often dependent on the ordinal variableQuantitative-Quantitative: e.g. mapsview distribution of data as a function of one or both variables; discover causal relationshipsTypes of Graphics (Ordinal- Ordinal)Axis variables are independent of each other
R represents the fields encoded in the retinal properties of the marksFollowing slide shows sales and margin as a function of product type, month and state for items sold by coffee chain
Ordinal-Quantitative GraphicsBar charts, dot plots, Gantt chartQuantitative variable is dependent of ordinal variable
Following slide shows a case where a matrix of bar charts is used to study several functions of the independent variables product and month
Quantitative-Quantitative GraphicsDiscover causal relationships between the two quantitative variables.Following slide shows how flight scheduling varies with the region of the country the flight originated
Visual mappingsEncoding different fields of the data to retinal propertiesShape, Size, Orientation, ColorUsed in the ordinal to ordinal example
Display Types
Gantt charts of events for a parallel graphics application on a 32-processor SGI machine.Flights between major airports in the USASource code colored by cache misses for a parallel graphics application.Major wars and the births of well known scientists as a timeline.QueryingThree steps:Select the recordsPartition the records into panesTransform the records within the panes
To create database queries, it is necessary to generate an SQL query per table pane (i.e. must iterate over entire table, executing SQL for each pane).
Transformations and Data FlowGenerating Database Queries1. Selecting the Records
Generating Database Queries2. Partitioning the records into painsPutting retrieved records in their corresponding pane
Generating Database Queries3. Transforming Records within the PanesIf aggregation, it is done here
Example applicationCut expenses for a national coffee storeCreate table of scatterplots showing relationship between marketing costs and profit (Figure 6a)Notice trend; certain products have high marketing costs with no or little profit
PolarisDemo: Tableau Software
Related WorkSingle relation visualizationAPTSage/SageBrushDEVise
Multiple relation visualization VisageDataSplash/Tioga-2Rivet/PolarisSieveRelated WorkFormalisms for GraphicsWilkinsons Grammar of Graphics Bertins Semiology of GraphicsMackinlays APT
Visual QueriesTrellis display, DeVise, Visage
Table-based VisualizationsTable lens, Spreadsheet for VisualizationInteresting, upcoming projects:IBM Many Eyes: Site allows users to upload data and then produce graphic representations for others to view and comment upon for free!Processing: Open source programming language and environment for people who want to create images, animations, and interactions.Prefuse: Interactive information visualization toolkit
Commercial visualization software: Tableau, Qlikview, Tibco Spotfire, Microsoft BI platform (PowerPivot, Excel 2010, SQL Server with VertiPaq, SSAS, SSRS and SSIS)To "democratize" visualization, and experiment with new collaborative techniques, we builtMany Eyes, a web site where people may upload their own data, create interactive visualizations, and carry on conversations. The goal is to foster a social style of data analysis in which visualizations serve not only as a discovery tool for individuals but also as a means to spur discussion and collaboration.
Processing: from MIT Media Lab: Processingis anopen sourceprogramming languageandintegrated development environment(IDE) built for the electronic arts and visual design communities with the purpose of teaching the basics ofcomputer programmingin a visual context, and to serve as the foundation for electronic sketchbooks.
One of the stated aims of Processing is to act as a tool to get non-programmers started with programming, through the instant gratification of visual feedback. The language builds on theJava programming language, but uses a simplified syntax and graphics programming model.
55Wilkinsons Grammar of GraphicsDescribes formalism for statistical graphics
Different choices in the design of formalism:non-relational data modeldifferent operators in table algebra
Further experience necessary to fairly evaluate differences between our formalisms
56ConclusionsNovel interface for rapidly constructing table-based graphical displays from multi-dimensional relational databases
A formalism for specifying complex graphics and tables
Interpretation of visual specifications as relational (SQL) queries and drawing operations.
Interactive exploration of large multi-dimensional databasesExpressive set of graphical displaysUses tables to organize multiple graphs on a display
57DiscussionAllows overlap between the relations that are divided into each pane of the Polaris display, unlike the basic Pivot Table model.
Allows more versatile computation of aggregates (e.g., medians and averages, in addition to sums).
Intuitive drag-and-drop interface, like that seen in Pivot TablesRemarksMerits:A cohesive architecture for coordinating visualization componentsFlexible and easy user interface, no programming neededSupport for visual query Good integration between query and visualization schemaShortcomings:Not an extensible architecture for the data analysis systemLimited support for coordinated data navigation (pan, zoom ) Lack of support for hierarchical data (fix: dot operator)
Possible ImprovementsGenerate database tables from a selected set of marks. Use selected mark in one display as the data input to another.
Integrate a table lens, instead of having to click a mark to view its details.
Exploring interaction techniques for navigating hierarchical structures of multi-dim databases.
Provide an adapter to link external data sources without explicitly storing data in the analysis system.
Polaris: Extended Formalism Additional formalism defined in papers*:specification of different graph typesencoding of data as retinal properties of marks in graphsdata transformationstranslation of visual specification into SQL queries* Relevant papers:Query, Analysis, and Visualization of Hierarchically Structured Data using PolarisChris Stolte, Diane Tang and Pat HanrahanProceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 2002. Polaris: A System for Query, Analysis and Visualization of Multi-dimensional Relational Databases (extended paper)Chris Stolte, Diane Tang and Pat HanrahanIEEE Transactions on Visualization and Computer Graphics, Vol. 8, No. 1, January 2002.
Formalism: ExtensionsCan mix graph types in single visualization:
Dot (.) operator: HierarchiesMany data warehouses have hierarchical dimensions:Time: Year, Month, DayLocation: Country, State, RegionDot (.) works like Nest (/) except it exploits the defined hierarchiesbased on semantics not tuples in databaseDemo
Questions?
http://graphics.stanford.edu/projects/polaris/