CS5604: Information and Storage Retrieval Fall 2017 FE ... · which are presented in visualizations...

CS5604: Information and Storage Retrieval Fall 2017FE (Front End Team)

Ultimate Question: How can we best build a state-of-the-art information retrieval and analysissystem?

Keywords: GETAR, Spatial Search, Information Retrieval, Geospatial Data, Visualization,GeoBlacklight, D3

Team FE (Front End):Haitao Wang

Yali BianShuo Niu

Jieun Chon

CS 5604Fall 2017

Virginia TechBlacksburg, Virginia 24061

Supervisor: Professor Edward A. Fox

Dec. 21, 2017

Contents

List of Figures 6

List of Tables 9

Abstract 10

1 Introduction 111.1 GETAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2 Course Description and Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 Team FE (Front End) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.1 Team Goals and Responsibilities . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.2 Team Members . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.3 Roles of each member . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Background and Literature Review 142.1 Faceted Search and Information Retrieval . . . . . . . . . . . . . . . . . . . . . . 14

2.1.1 Faceted Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.2 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Blacklight and GeoBlacklight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.1 Blacklight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.2 GeoBlacklight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Information Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.1 Big Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.2 Visualization of Social Network Data and Geo-spatial Data . . . . . . . . . 19

2.4 Visualization Techniques and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Contents 3

2.4.1 Visualization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.2 Visualization Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Project Requirements and Work Plan 243.1 Project time-line and Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.1 GeoBlacklight Development Time-line . . . . . . . . . . . . . . . . . . . . 24

3.1.2 Visualization Implementation Time-line . . . . . . . . . . . . . . . . . . . 27

3.2 Project Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 Previous and Current Works . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 GETAR Portal Design, Implementation, and Customization 314.1 Target User . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 User requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.2 User Requirement Identification . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Front End Interface Components and Features . . . . . . . . . . . . . . . . . . . . 32

4.2.1 Text/Data Search Component and Proposed Features . . . . . . . . . . . . 32

4.2.2 Map View Component and Proposed Features . . . . . . . . . . . . . . . . 33

4.2.3 Graphic Visualization Components and Proposed Features . . . . . . . . . 34

4.2.4 User management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Case Scenarios and Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4 Front End Wireframe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.5 HBase Schema and Solr Schema for GeoBlacklight . . . . . . . . . . . . . . . . . 35

4.5.1 Naming Scheme for GeoBlacklight Solr schema . . . . . . . . . . . . . . . 35

4.5.2 GeoBlacklight Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.5.3 Map GeoBlacklight Schema with HBase Schema . . . . . . . . . . . . . . 38

4.5.4 Sample data and initial Front End prototype demo . . . . . . . . . . . . . . 38

4.6 Set Up Server and Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.6.1 VM and Ubuntu: Create a Development Virtual Machine . . . . . . . . . . 42

4.6.2 Java: Install Java Open JDK on Ubuntu 16.04 . . . . . . . . . . . . . . . . 43

4.6.3 Git: Installation and Configuration . . . . . . . . . . . . . . . . . . . . . . 44

4.6.4 Ruby Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.6.5 Rails Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.6.6 Set up Relational Database (MySQL) . . . . . . . . . . . . . . . . . . . . 51

4.7 Create Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.7.1 First Rails Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.7.2 Install GeoBlacklight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.7.3 Install default GeoBlacklight . . . . . . . . . . . . . . . . . . . . . . . . . 56

3

Contents 4

4.8 Index Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.9 Connection with Another Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.10 Final Version of GETAR Portal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.10.1 Development History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.10.2 Functionalities of GETAR Portal . . . . . . . . . . . . . . . . . . . . . . . 62

4.10.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 Visualization Design: Implementation, and Evaluation 685.1 Data Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.2 Task Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3 Design and Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.3.1 Dynamic Queries: Searching and Filtering . . . . . . . . . . . . . . . . . . 80

5.3.2 Visualization with Multi-views . . . . . . . . . . . . . . . . . . . . . . . . 80

5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.4.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.4.2 Query interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4.3 Tweet Card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.4.4 Visualization with Multi-views . . . . . . . . . . . . . . . . . . . . . . . . 83

5.4.5 Solr API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.4.6 PHP Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6 Acknowledgement 96

A Installation Links 97A.1 Install UbuntuOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

A.2 Install Ubuntu 16.04 LTS on VirtualBox/Windows 10 . . . . . . . . . . . . . . . . 97

A.3 Set Up for Ruby On Rails on Ubuntu 16.04 . . . . . . . . . . . . . . . . . . . . . 97

A.4 Vim Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

A.5 Vim Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

A.6 Vim for Editing Files in Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

A.7 Download Solr sample data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

B Error Solutions 99B.1 SQL Error Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

B.1.1 Access denied for user ‘root’@‘localhost’ . . . . . . . . . . . . . . . . . . 99

B.2 GeoBlacklight Error Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

B.2.1 core already exists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4

Contents 5

C Other Resources 100C.1 GeoBlacklight Schema GitHub . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

Bibliography 101

5

List of Figures

2.1 A general model of of information retrieval systems [11] . . . . . . . . . . . . . . . . 15

2.2 Blacklight Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 GeoBlacklight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Front End wireframe for Text Search and Map View . . . . . . . . . . . . . . . . . . . 18

2.5 Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6 Proposed Front End components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.7 Visualization Prototypes that could be used on this project [29] . . . . . . . . . . . . . 23

3.1 Project schedule and workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Proposed Front End components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Use Cases Scenarios and Workflow of the Application . . . . . . . . . . . . . . . . . . 35

4.3 GeoBlacklight modular architecture and inner relations to HBase table and Solr index . 38

4.4 Data Structure of Sample JSON Used for Testing . . . . . . . . . . . . . . . . . . . . 40

4.5 Data Structure of Sample JSON Used for Testing . . . . . . . . . . . . . . . . . . . . 40

4.6 Prototype demo: spatial search with map bounding box . . . . . . . . . . . . . . . . . 41

4.7 Prototype demo: map clustering visualization . . . . . . . . . . . . . . . . . . . . . . 42

4.8 Prototype demo: heat map visualization . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.9 Ubuntu ISO download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.10 Open Terminal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.11 Update Ubuntu Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.12 Java Installation and Version Check . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.13 Git Installation and Version Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.14 Git Installation and Version Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

List of Figures 7

4.15 Git Configuration Verification on the GitHub Account . . . . . . . . . . . . . . . . . . 484.16 Git Configuration Verification on the Ubuntu . . . . . . . . . . . . . . . . . . . . . . . 484.17 Update Ubuntu and Install SW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.18 Install Ruby using rbenv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.19 Ruby Version Check and Install Bundler . . . . . . . . . . . . . . . . . . . . . . . . . 504.20 Download NodeJS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.21 Install NodeJS and Version Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.22 Rails 5.1.4 Version Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.23 Create New Rails Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.24 Launch Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.25 First Rails Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.26 Access Gemfile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.27 Edit Gemfile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.28 Bundle Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.29 GeoBlacklight Launching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.30 http://localhost:3000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.31 Relaunch GeoBlacklight after adding new Schema . . . . . . . . . . . . . . . . . . . . 574.32 GeoBlacklight with new Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.33 GeoBlacklight with Faceted Searching . . . . . . . . . . . . . . . . . . . . . . . . . . 584.34 SSH to Solr index server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.35 Run Rail Server for Launching GeoBlacklight . . . . . . . . . . . . . . . . . . . . . . 594.36 Run Rail Server for Launching GeoBlacklight . . . . . . . . . . . . . . . . . . . . . . 594.37 Run Rail Server for Launching GeoBlacklight . . . . . . . . . . . . . . . . . . . . . . 594.38 Update Solr server connection in blacklight.yml . . . . . . . . . . . . . . . . . . . . . 604.39 Launch Rail on the mule.dlib.vt.edu Server . . . . . . . . . . . . . . . . . . . . . . . . 614.40 First Version of the GeoBlacklight . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.41 Second Version of the GETAR Portal with GeoBlacklight . . . . . . . . . . . . . . . . 634.42 Second Version GETAR Portal’s Searching Result . . . . . . . . . . . . . . . . . . . . 644.43 Final Version of the GETAR Portal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.44 Final Version GETAR Portal’s History Page . . . . . . . . . . . . . . . . . . . . . . . 654.45 Final Version GETAR Portal’s Searching Page . . . . . . . . . . . . . . . . . . . . . . 654.46 Final Version GETAR Portal’s Facet Searching Sidebar . . . . . . . . . . . . . . . . . 664.47 Final Version GETAR Portal’s User Page . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.1 Data Abstraction (figure from [13]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.2 Task-Abstraction [12] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.3 Prototype of Visualization Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7

List of Figures 8

5.4 Overview of TweetBank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.5 System implementation overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.6 Query Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.7 Tweet Card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.8 Social Network Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.9 Two charts for tweets by user categories . . . . . . . . . . . . . . . . . . . . . . . . . 875.10 Stream graph with time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.11 Tweets Tag Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.12 Geo-location of Tweets: Countries and States . . . . . . . . . . . . . . . . . . . . . . 925.13 Geo-location of Tweets: State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

8

List of Tables

4.1 REQUIRED metadata for GeoBlacklight . . . . . . . . . . . . . . . . . . . . . . . . . 364.2 REQUIRED metadata for GeoBlacklight . . . . . . . . . . . . . . . . . . . . . . . . . 374.3 Proposed additional metadata for Front End features . . . . . . . . . . . . . . . . . . . 374.4 Data mapping table among HBase, Solr, and GeoBlacklight (partial) . . . . . . . . . . 39

Abstract

Social media and Web data are becoming important sources of information for researchers

to monitor and study global events.

GETAR [25], led by Dr. Edward Fox, is a project aiming to collect, organize, browse, visu-

alize, study, analyze, summarize, and explore content and sources related to biodiversity, climate

change, crises, disasters, elections, energy policy, environmental policy/planning, geospatial

information, green engineering, human rights, inequality, migrations, nuclear power, population

growth, resiliency, shootings, sustainability, violence, etc.

This report introduces the work of the Front End (FE) team analyzing users’ requirements

and building user interfaces for people to explore tweet/webpage data. The work of the FE

team highly relies on the results from other teams. Our duty includes presenting the collected

tweets/webpages, visualizing the clusters and topics, showing the indexed and clustered search

results, and last but not least allowing users to perform customized queries and exploration.

Therefore the team needs to consider how other teams collect and manage the data, as well as

how people utilize the information to gain insights from the data repository.

Throughout Fall 2017, our team aims to bridge the data archive and users’ need, focusing on

providing various user interfaces for tweet/webpage exploration and analysis. Overall, two main

user interfaces are designed and implemented throughout the semester. (1) A visualization-based

analytical tool for people to create categories by searching and interacting with filtering tools,

which are presented in visualizations such as bar-chart, tag cloud, and node-link graph. (2) A

geo-based interface for location-based information, implemented with GeoBlacklight, enabling

users to view tweets/webpages on maps.

This report documents the background, plans, schedule, design, implementation, software

installation, and other related useful information. We used Solr and a triple-store to provide

data, and the "getar-cs5604f17-final_shard1_replica1" collection was used in the final testing

and delivery. An overview of the team work and detailed design and implementation are both

provided. We highlight the visualization-based interface and the location-based interface, as

they provide visual tools for people to better understand the data collected by all the teams.

We seek to provide information on how we extract users’ requirements, how user needs are

reflected in light of the related literature, and how that leads to the design of the visualization and

geo-interface. An installation manual is also detailed, seeking to help other software engineers

who will keep working on GETAR to reuse our work.

Chapter 1

Introduction

The Front End group is responsible for providing the tweet and webpage content through a great userinterface with intuitive interactions through existing techniques, e.g., Blacklight and GeoBlacklight,based on top of the GETAR (Global Event and Trend Archive Research) project [25].

1.1 GETAR

The GETAR project [25] will devise interactive, integrated, digital library/archive systems coupledwith linked and expert-curated webpage/tweet collections, covering key parts of the 1997-2020timeframe, supporting research on urgent global challenge events and initiatives.

It will allow diverse stakeholder communities to interactively: collect, organize, browse, visualize,study, analyze, summarize, and explore content and sources related to biodiversity, climate change,crises, disasters, elections, energy policy, environmental policy/planning, geospatial information,green engineering, human rights, inequality, migrations, nuclear power, population growth, resiliency,shootings, sustainability, violence, etc.

GETAR [25] will leverage VT research on digital libraries, natural language processing, HCI,information retrieval, machine learning, discovery analytics, and Web archiving.

1.2 Course Description and Purpose

As a course project, the main work for students is learning how to solve real world data problemsand create projects that should be incorporated into the GETAR [25] project as a production-quality

Chapter 1. Introduction 12

service, open for use by a broad range of stakeholders.

1.3 Team FE (Front End)

1.3.1 Team Goals and Responsibilities

Our FE (Front End) team, part of the class of students, will work with other teams to provide morepowerful functions for the GETAR [25] project: Classification, Collection Management Tweets,Collection Management Webpages, Clustering and Topic Analysis, and SOLR.

Our FE team uses their results and communicates with users. Our main job will be representingother groups’ output (mainly through the SOLR team’s searching results) to users effectively andefficiently through web based interfaces and visualizations based on users’ interests and interactions.

1.3.2 Team Members

There are 4 members from course CS5604 in our FE team:

• Haitao is a project associate at CGit (Center for Geospatial Information Technology).

His specialty includes web mapping, geo-visualization, land use, land cover change, andvegetation dynamics.

He has dual efficiencies in both geospatial analysis / programming and environmental applica-tions.

• Yali Bian is a Ph.D. student of Computer Science under the direction of Dr. Chris North,whose research topic is visual analytics: combining visualization with data mining/machinelearning techniques to help people make sense of large collections of datasets.

• Shuo Niu is a Ph.D. student in the Department of Computer Science, supervised by Dr. ScottMcCrickard.

His research is focused around human-computer interaction in general, with special interest incollaborative visualization systems on multi-user multi-touch displays.

• Jieun Chon is a Master’s student, working as a Graduate Research Assistant at Virginia Techunder the direction of Professor Kafura and Bart’s Computational Thinking class and ProfessorShaffer’s OpenDSA research.

She is working as a Full Stack Developer, typically Front End. She also worked on the serverside, to launch the application on the DiGital Library Research Laboratory Hadoop cluster[28].

12

Chapter 1. Introduction 13

1.3.3 Roles of each member

• Haitao focuses on the spatial search module and map based visualization.

He designs the initial wireframe for the Front End of the map view, as well as develops a mapvisualization prototype with test data as a proof-of-concept.

He is also collaborating with Jieun Chon on implementation of a GeoBlacklight instance andpreparing schema.

• Yali Bian is working on the design and development of the tweet data visualization part.

He designed and implemented the framework of the multiview visualization and interactions,was responsible for the development of the tweet card view, tag cloud, geo-map and socialnetwork graphs, and the brushing and linking interaction between card view, tag cloud, andgeo-maps.

• Shuo Niu prepares test data and works with Yali on interaction design.

In the early stage Shuo imports the test data into Solr and builds a Blacklight server to run thetest data.

Shuo also extracts user requirements and talks with data users to understand how the tweetswill be analyzed.

Based on that knowledge he works with Yali in the design of tweet visualization.

• Jieun Chon is working on the system side, including installing needed packages and launchingGeoBlacklight successfully.

Jieun is in charge of the GeoBlacklight connection with Solr and also she worked on creatingmanuals by devising a step-by-step command line guide with screen shots.

Throughout the semester, she created the test data for Solr and tested it with GeoBlacklight.

On the other hand, she is working on developing the reports using LATEX (Overleaf).

13

Chapter 2

Background and Literature Review

2.1 Faceted Search and Information Retrieval

2.1.1 Faceted Search

The job of the FE team is providing an appropriate information interface for users to search, explore,and understand the archived data.

The most basic feature is a searching interface to retrieve from the tweet/webpage repository.

Searching boxes may include other features to help the user simplify searching, such as auto-complete, search suggestions, a spelling checker, etc.

Search boxes are often also accompanied by drop-down menus or other input controls to allowthe user to restrict the search or choose what type of content to search for.

In some cases, while users input search strings, the results of that string would also be present inthe content area.

However, if the page employs this way to show results to users, the loading time is pretty slowand sometimes it may cause a freezing state or browser crash.

For more complicated searching purposes, faceted search is the most common searching tech-nique [19], which is also called faceted navigation or faceted browsing [17][18].

It is a technique for accessing information organized according to a faceted classification system,allowing users to explore a collection of information by applying multiple filters.

A faceted classification system classifies each information element along multiple explicitdimensions, called facets, enabling the classifications to be accessed and ordered in multiple ways

Chapter 2. Background and Literature Review 15

Figure 2.1: A general model of of information retrieval systems [11]

rather than in a single, pre-determined, taxonomic order.

2.1.2 Information Retrieval

Information retrieval (IR) aims at finding relevant documents that satisfy an information need fromwithin large computer-based collections.

It also supports the browsing or filtering of document collections or further processing of a set ofretrieved documents.

Usually a “document” is a piece of text [10], while modern document collections often containemails, newsgroups, and webpages [8].

An information retrieval system could be considered as an automatic system that reads a questionby a user and automatically performs a search through an internally stored collection of documents totry to retrieve those documents most relevant to the question [9]. Such systems have been developedpragmatically for catalogs, commercial online services, bibliographic utilities, or optical disk filedproducts in the past several decades. A general model of an information retrieval systems is presentedin Figure 2.1

On the web, there are a variety of online information retrieval systems for commercial, educa-tional, or non-profit purposes, such as search engines (e.g., Google, Baidu), digital libraries (e.g.,the Networked Digital Library of Theses and Dissertations), and so on. One of the most significantonline systems is the Internet Archive (http://www.archive.org/). It has actively archived more than279 billion webpages, 11 million books and texts to date since 1996, and has been providing freeand public access to its documentary records to serve millions of people each day.

15


Figure 2.2: Blacklight Features

2.2 Blacklight and GeoBlacklight

2.2.1 Blacklight

Blacklight is an open source web based application developed in Ruby on Rails, that provides a basicdiscovery interface for searching an Apache Solr index, and provides search box, facet constraints,stable document URLs, etc., all of which is customizable via Rails (templating) mechanisms.

As Figure 2.2 shows, Blacklight accommodates heterogeneous data, allowing different informa-tion displays for different types of objects. Blacklight offers these features and more:

• Stable URLs for search and record pages allow users to bookmark, share, and save searchqueries for later access.

• Supports OpenSearch, a collection of simple formats for the sharing of search results

• Search queries can be targeted at specific sets of fields.

• Faceted searching

16


Figure 2.3: GeoBlacklight

• Results sorting

Providing both basic and complicated searching functions needs a great connection betweenFront End facet searching UIs and underlying information retrieval engines. Since our project isbased on the Solr system, we will use Blacklight as the basic application framework to glue themtogether.

2.2.2 GeoBlacklight

GeoBlacklight is a platform for geospatial (GIS) data. Figure 2.3 shows a demo version of GeoBlack-light which is an open collaborative project with many features:

• Facet by place, subject

• Text and spatial search with ranking

• Facet by institution, year, publisher, data type, access, format

• Results list view with icons, snippets, and map view of bounding boxes

• Spatial search on map in result list

• Built on top of Blacklight platform

17


Figure 2.4: Front End wireframe for Text Search and Map View

– Search history

– Sort by relevance, year, title

– Customizable skin and facets

For more geo-visualization specific applications, that provide and visualize Geographic Infor-mation Systems (GIS) data which are structured data in specific file formats (Shapefiles, Rasters,etc.), we will use GeoBlacklight, based on Blacklight, which aims to provide a simple, effectiveopen-source application for discovery of geospatial data.

2.3 Information Visualization

2.3.1 Big Data Visualization

For other more complicated and large tweet/webpage datasets where users don’t know what is themost interesting subset they want to check, the most effective and efficient way is providing overviewinformation about the big dataset visually. Information visualization is the study of interactive visualrepresentations of abstract data to reinforce human cognition.

18


For information visualization applications about tweets or texts, most visual interface designscombine tag clouds and stream graphs on top of text mining methods such as topic models orTF-IDF based techniques. Also, to provide users the ability to explore and narrow down interestingsubsets of big data, interactions should be integrated into visualization applications, such as dynamicquery, direct manipulation, brushing, details on demand, filtering, linking and brushing, magic lens,navigation, zoom, and pan. For our team, all those visual components and interaction techniquescombined could be used to provide users effective exploration of big collections of tweets/webpages.

2.3.2 Visualization of Social Network Data and Geo-spatial Data

Twitter and other social media data such as blogs already have become one of the major informationsources for people to understand global events and public opinion. People post and re-post, comment,like, hash tag, and @-mention others on Twitter, which makes Twitter data valuable to probe notonly people’s thoughts but also their communication about topics and events.

Utilizing and analyzing Twitter data needs to be supported by visual tools, as the complex topicpatterns and communication structures are not directly understandable to data users, especially whencomparing different topics or understanding connections.

Twitter Visualization Studies

Twitter visualization has been extensively studied in recent years. Some works focus on identifyingand visualizing keywords or tags among one or a collection of Twitter topics. For example, tag-boostis a method to identify keywords from Twitter posts to generate word cloud visualizations [21].Some other studies investigated the social network and interactions between people on Twitter [3].For example, Aragon et al. studied the social communications on Twitter during the Spanish nationalelection, and the retweets between different political groups were visualized in a node-and-linkvisualization [1].

Recent advancement in interactive visualization brings in more visual designs which depictTwitter topics with various interactive graphical metaphors. Bhulai et al. investigated a method toassess the trend of topics and visualize the ascending topics with a tree-map visualization [2]. Guilleet al. proposed a mention-anomaly-based approach to detect burst topics on Twitter, which arevisualized in a stack chart [4]. Another important type of Twitter visualization is the geographicalview of the data.

TwitterReporter collects Twitter information about late breaking news and visualizes news feedson an interactive map [6]. TweetXplorer associates multiple visualization techniques and visualizes anumber of tweets on a map. WallTweet is a prototype implemented on a large high-resolution displaywhich shows topics and key hash tags on a large map [7]. In another similar work, ScatterBlogs,

19


messages on a global crisis are visualized on a large map view to enhance users’ awareness to globalevents [5].

2.4 Visualization Techniques and Tools

Both academic research and industrial design and development of information visualizations shouldbe based on the following two parts: (1) update to data visualization techniques including avisual interface and interactions, and (2) visualization tools providing visual prototypes and basicinteractions.

2.4.1 Visualization Techniques

A visual interface is the core design part for visualization applications: the main design for agraphical representation of data or concepts. There are some basic techniques our FE team shouldlearn to make sure the visualization design addresses the current purpose: to construct a visualizationapplication out of a set of design choices and let the users get what they want, effectively andefficiently. Here are the rules of thumb that we could use in our design:

Eyes Beat Memory

“Using our eyes to switch between different views that are visible simultaneously has much lowercognitive load than consulting our memory to compare a current view with what was seen be-fore.” [13] There are multiple features of our tweet and webpage contents: author, date, events,topics, keywords, geo-location, and so on. We plan to use several views on the same screen, to showusers the same data instance from different perspectives.

Overview First, Zoom and Filter, Detail on Demand

Ben Shneiderman’s influential mantra of Overview First, Zoom and Filter, Details on Demand is aheavily cited design guideline that emphasizes the interplay between the need for overview and theneed to see details, as well as the role of data reduction in general and navigation in particular insupporting both. [14]

For our tweets and webpage data, there are a huge number of elements. It is impossible to showall the elements on the screen at the same time, and also it is not necessary. Ben Shneiderman’smantra [14] is our best rule to follow: at first, we only shows parts of the tweet or webpage contents,based on default parameters. Then, based on users’ interactions, or input strings, we could present

20


more related tweet or webpage elements for users and hide other unrelated elements shown on thescreen.

2.4.2 Visualization Tools

Developing visualization applications bottom-up is inefficient. There are many that are based ongraphic libraries, and there are great visualization tools that could be used to help developers, suchas D3.js, Process, Tableau, and R.

Since we are the FE team, and the requirement is that we should develop our Front End projecton top of Blacklight, a web-based open source Solr user interface discovery platform, the mostpowerful and best choice is D3.js, a JavaScript library for manipulating documents based on data.D3 [29] helps you bring data to life using HTML, SVG, and CSS. D3’s emphasis on web standardsgives you the full capabilities of modern browsers without tying yourself to a proprietary framework,combining powerful visualization components and a data-driven approach to DOM manipulation(Figure 2.5).

D3.js could be easily applied with other JavaScript libs, such as jQuery. The usage of D3.js isjust as simple as other JavaScript: link the JavaScript lib using the src parameter on the loadingscript part HTML (Figure 2.6).

21


Figure 2.5: Visualizations

Figure 2.6: Proposed Front End components

22


Figure 2.7: Visualization Prototypes that could be used on this project [29]

(a) Stream Graph (b) GeoMap

(c) Word Cloud (d) Force Directed Graph

23

Chapter 3

Project Requirements and Work Plan

3.1 Project time-line and Schedule

After the FE team was organized, we created a time-line showing each week, and also listed all ourclass dates. Our team has only four members, so we are trying to plan ahead and use our time veryefficiently.

3.1.1 GeoBlacklight Development Time-line

See Figure 3.1 for the work flow of the team. Also, items below are weekly tasks we have donethroughout the semester:

• Week 1: Beginning of the class, met the class members.

• Week 2: Understood the project, applied for FE team and got accepted. Read and researchedabout GETAR project. Explored demos of both Blacklight and GeoBlacklight application onfeatures, architecture, structure of search results, and so on. Prepared documentation aboutexample search results JSON file in shared work folder. Proposed teamwork software suites,created Slack channel, and integrated Google Doc and Google Calendar with Slack (Haitao).

• Week 3: Started installing and launching GeoBlacklight (Jieun and Haitao 50% of the work,each). Made decisions on application modules and features. Drafted outline for report 1 byHaitao.

Chapter 3. Project Requirements and Work Plan 25

Figure 3.1: Project schedule and workflow

25


• Week 4: Published GeoBlacklight Manual. Report writing on Google Docs (Haitao, Jieun,and Yali 33% of the work). Formatted Google Document to LaTeX by Jieun Chon.

• Week 5: Worked on GeoBlacklight. Investigated errors and solved problems. Updateddetailed manual on report 2 (Haitao, Jieun 50% each).

• Week 6: Successfully deployed GeoBlacklight on a local virtual machine (Jieun and Haitao50% work each). Drafted outline for report 2 by Jieun.

• Week 7: Created ShareLatex, Google document to list TODOs and Google Slides for thereport/presentation 2 (Jieun). Started on report 2 writing (25% work everyone). Developedmap based visualization prototypes with test JSON data (Haitao). Creating data set andlaunched with Solr (Jieun and Haitao 50% work each).

• Week 8: Had a meeting with Prof. Fox to discuss report 2, had good comments to fix frompaper 1 (Jieun). Prepared application Front End initial wireframe design (Haitao). Proposedaddtional metadata schema for GeoBlacklight. Drafted data mapping table among HBase,Solr, and GeoBlacklight’s Front End UI (Haitao and Jieun).

• Week 9: Talked with SOLR team leader, Jeff, to make sure they can make Solr include datathat fits the GeoBlacklight Schema. Started working on GeoBlacklight customization too.Prepare improved testing dataset (Haitao and Jieun).

• Week 10: Created Google document, Overleaf to list TODOs for the report/presentation 3(Jieun). Edited paper 2 to continue to paper 3. Worked on Solr/GeoBlacklight Config andGeoBlacklight customization. Started working on paper 3 (due Nov. 9). Tested CustomizedGeoBlacklight with new sample data.

• Week 11: Paper 3 completed by Thursday midnight. Successfully connected GeoBlacklightwith cs5604fe_solr team’s Solr. Discussed with the SOLR team to get correct attribute names.Add data-flow and explanation about it into paper 3.

• Week 12: Worked on on GeoBlacklight customization and make manual for that. Implementeddesigned map view and visualization.

• Week 13: Thanksgiving Break. Figured out connection between SOLR (Haitao 70% workand Abhinav helped during the break). Launched on the mule.dlib.vt.edu server (Jieun Chonwith TA’s help).

• Week 14: Test with SOLR team’s data. Worked on the customization (Haitao). Worked onthe final paper and presentation (Jieun and Haitao).

26


• Week 15: Preparation Final Presentation and Paper Submission

• Week 16: End of the semester

3.1.2 Visualization Implementation Time-line

For the visualization part of the FE team, Shuo and Yali work together and have equal contributions.The work benefited from advice and comment from our group members Haitao and Jieun. Shuo ismainly responsible for loading and querying needed data from the SOLR team, as well as imple-menting the state map, time-line, and tweets by user categories views. Yali is mainly responsible forthe design of the visualization framework, tweet list, world map, social network graph, and wordcloud, as well as loading social network data from the RDF server.

• Start: talking with other group members, reading related material about GETAR project.(Done)

• Requirements Analysis: talking with professors in GETAR Project (Shuo Niu). Findingrelated works on tweet visualization and related techniques. (Yali Bian) (Done)

• Literature Reviews: reading related academic research works. (Shuo Niu, Yali Bian) (Done)

• Related Techniques: summarizing visualization techniques that might be suited for ourproject: D3js, React.js, and other libs. (Yali Bian) (Done)

• Familiar with Data: to make full use of other teams’ outcome, communicating with othergroups. (Shuo Niu) (Done)

• Visualization Design and Prototype: designing the prototype frame of visualization system,and multi-views needed to represent the underlying dataset. (Yail Bian) (Done)

• Report 1-3: writing Chapter 5, and part of Chapter 2 reports on the visualization part andrelated works on visualization, preparing demos for presentations. (Shuo Niu, Yali Bian)(Done)

• Connecting with Solr: designing and implementing the query bar, configuring and connectingthe Solr server based on the SOLR team’s schema, cleaning returned data from the Solr server.(Shuo Niu) (Done)

• Implementation and Development: implementing visualization views based on existingdataset we can get from other teams: tweetlist, word cloud (Yali Bian), tweets by usercategories, and state map. (Shuo Niu) (Done). The social network graph and world map are

27


still under development since some data features needed are not available right now. (YaliBian, Shuo Niu) (Done)

• Evaluation: Performing user study and case study to make sure our visualization project iseffective and efficient, improve the system based on feedback from other teams and instructors.(Done)

• Future Work: performing some pilot study before the final evaluation, and collecting moreusage feedback from users/researchers.

3.2 Project Delivery

3.2.1 Previous and Current Works

Interim report 1

We have 3 interim reports and one final report at the end of the semester. In the first paper, wefocused on understanding the project, figuring out what is the most important work for our team,understanding and learning development software tools and flat-forms, working on installation ofGeoBlacklight, and trying to connect with the Solr system.

Reviews from the Classmates

We had a good amount of reviews from the class members. Most of the opinions were relatedwith two points: Too detailed manual and no abstract section. We agree that we missed writing anabstract. However, one of our goals is giving a good manual for the students or researchers who willuse GeoBlacklight later. Many students in this class are from non-Computer-Science majors too.Also we cannot assume that all the students and researchers know how to use Git, how to install Java,or other installation tasks. For this reason, I would like to indicate that one of the most importantpurposes of this paper is giving good direction for the future researchers or students who will beworking on the Front End part.

Interim Report 2

As we indicated in Interim Report 1, we were expecting to have a successful connection with theSolr system, make GeoBlacklight work, and complete data set testing work by October 17, which isthe due date for report 2. Fortunately, we successfully connected with the Solr system and describeddetails in Interim Report 2. We updated about installation details for GeoBlacklight & Solr. Weadded SQL error solutions and a short explanation to describe how "gemfile" works and how the gem

28


file leads to installing GeoBlacklight, and the solution for fixing the "core already exists" error. Alsowe changed the LATEXformat from book format to PDF format and added an abstract to the beginning.We divided up chapter 1 with more details for each section. Finally we wrote a detailed descriptionfor the Front End wire-frame and explanation for the HBase and Solr schema for GeoBlacklight.

Interim Report 3

We edited errors and reordered content from Interim Report 2 and continued adding about ourprogress. After the submission of Interim Report 2, the GeoBlacklight team had a good amount oftime with the SOLR team to discuss about the schema. Both of us now have very clear understandingof each other’s work and plans. However, we are still figuring out how to connect with another Solrsystem. We have done testing queries in their Solr to look up precisely to make sure their schemahas all the required fields and let the SOLR team know if anything was missed. Right now, theFE team and SOLR team have all the required data and installed tools to be connected. We havesuccessfully launched GeoBlacklight with the Fall 2017 SOLR team’s Solr. However, some of theattributes are not satisfying GeoBlacklight’s requirements, so we explained what we need to fix theschema on 4.5.1.

For the visualization part, we added Section 5.4 Implementation: the system overview, queryinterface, tweet card, and multi-views: like social network graph, tweets by user categories throughbar chart and pie chart, time-line, tag cloud, and geo-map. The progress after Interim Report 2 is theimplementation of visualization. After talking with classmates and professors, we improved the UIof tweet card and tag cloud, and added more views: geo-map, time-line, and social networks.

Final Report and Presentation

For week 12, we worked on the customization and testing with our own dataset. Also, we copied theapplication under the mule.dlib.vt.edu server, so anyone can access the GeoBlacklight online. Weworked with the SOLR team to make the data ready before the Thanksgiving break. We launchedthe GeoBlacklight with their Solr and tested it with customized GeoBlacklight.

By the first week of December, we launched GeoBlacklight, with a working searching function. Itsupports functional searching and map/graphic visualization components. We planned on elaboratingthe progress of implementation by time-line and also we described details for the future Front Endteams, researchers, or users during the next semester or later.

For the visualization sub-team, with the update of the GETAR (Global Event and Trend ArchiveResearch) project, more datasets would come into the HBase, with more new data types neededto visualize and represent to users. Existing visualizations we designed might not handle newsdatasets, even though we designed our visualization based on the united HBase schema. Since the

29


visualization frameworks combined with interactions have already been designed and implemented,future work will be finished more quickly, since sub-views for new data types could be easilyimplemented and merged into the visualization frameworks.

30

Chapter 4

GETAR Portal Design, Implementation, andCustomization

4.1 Target User

4.1.1 User requirements

GETAR [25] collects events connected with crises, tragedies, and disasters – both natural andman-made – and also events of community or governmental interest, including elections. Accordingto the GETAR FAQ, there are three types of people who will view and use our software.

• People who would explore archive events they find of interest.

• People who would like to use the GETAR archive for research, development, or education.

• People who would like to perform preliminary research, identifying initial research ideas fromglobal event data.

Especially, we are planning to use Blacklight and GeoBlacklight and will work on thevisualization related with Geospatial data (GIS). Our target researchers and users will be the peoplewho are interested in and looking for event mapping, relation-visualization, visualization by time ofthe event, etc.

Chapter 4. GETAR Portal Design, Implementation, and Customization 32

4.1.2 User Requirement Identification

To meet our team’s goal, we need to meet user requirements to fulfill the users’ needs. Below is thelist of user requirement that our team is aiming for:

1. tweet searching.

2. webpage searching.

3. searching by facets such as date, author, location, follower, etc.

4. filter and sorting by various facets.

5. detailed instructions for users or help page.

6. map components and graphic visualization.

4.2 Front End Interface Components and Features

The Front End application is proposed to provide a usable interface through four integrated com-ponents: text search, map view, graphic visualization, and user management as shown below inFigure 4.1.

The proposed features for each component are as follows:

1. Text/Data Search Interface Module

2. Map View Interface Module

3. Graphic Visualization

4. User Management

4.2.1 Text/Data Search Component and Proposed Features

Below is the list of text and data searching components that we would like to implement:

1. Text search webpages and/or tweets

• by keywords

• by facets with metadata (time, date, location, author, followers, etc.)

• by facets with categories (event, other groupings, etc.)

32


Figure 4.1: Proposed Front End components

2. Sort and filter result

3. View full text or excerpt

4.2.2 Map View Component and Proposed Features

Below is the list of map view components that we would like to implement:

1. Spatial search webpages and/or tweets

• by specified location

• by current map extent

• by defined area of interest

2. Spatial filter of results

• by specified location

• by defined area of interest

3. Geo-visualization of results with location and other metadata

33


4.2.3 Graphic Visualization Components and Proposed Features

Below is the list of graphic visualization components that we would like to implement:

1. Visualization of search results or a subset of data from user preferences (with time, date,network, and other metadata, etc.)

• visualize events as time-line, spatiotemporal graphs

• visualize topics with streamgraph, tagcloud

• visualize network as forced directed graph, bubble

2. Support of user exploration

4.2.4 User management

Below is the list of user management components that we would like to implement:

1. Store user information (user ID, email)

2. Store user search behavior (search history)

3. Add or remove user account

4.3 Case Scenarios and Workflow

As can be seen in Figure 4.2, a user starts an information retrieval process either using text searchor spatial search. These support keyword (text) search, faceted constraints, and map view (spatialextent) constraints. The results will be provided as different views, including a data view (fulltext webpages or tweets), metadata view, map view, and visualization view. The user will also beprovided appropriate visualizations for metadata, networking schematics, or certain topics pertainingto the results (Figure 4.2).

4.4 Front End Wireframe

We developed an initial Front End wireframe mainly for the map view module. The Front Endwhich applies a left-right layout, provides a user interface for the proposed features on text searching,spatial search, and map visualization (Figure 4.2).

34


Figure 4.2: Use Cases Scenarios and Workflow of the Application

– Search Bar: Provides combined functionality for both text search and geocoding. A user isable to discover information by keywords, or quick locate to a certain location on the map bya place name.

– Map: Provides location based information visualization, geographic context for displayedinformation, and a portal for spatial search based on view extent.

– Result summary: Provides brief summary and statistics about current search results, such astotal number of items for both webpage and tweets, total number of tweets with geo-locations,and so on.

– Time slider: Provides animation of information change across time.

– Facets: Provides filter (topic, location, date, and so on) for search results.

4.5 HBase Schema and Solr Schema for GeoBlacklight

format: namespace_term_suffixExample: dc_title_s

4.5.1 Naming Scheme for GeoBlacklight Solr schema

Predefined namespace prefixes

• "dc": "http://purl.org/dc/elements/1.1/",

• "dct": "http://purl.org/dc/terms/",

35

http://purl.org/dc/elements/1.1/

http://purl.org/dc/terms/


Table 4.1: REQUIRED metadata for GeoBlacklight

suffix type indexed multiValued_b boolean TRUE_d double TRUE_dt date TRUE_f float TRUE_i int TRUE_l long TRUE_s string TRUE_ss string FALSE_si string TRUE_sim string TRUE TRUE_sm string TRUE TRUE_url string FALSE_blob binary FALSE

• "georss": "http://georss.org#",

• "layer": "http://geoblacklight.org/schema/1.0#"

Predefined field type suffixes

Core terms

There are some core terms like those below:

identifier, slug, title, provenance, rights, description, creator, isPartOf, temporal, format, issued,language, source, spatial, publisher, subject, type, geom_type

We explain the details in Table 4.3.

4.5.2 GeoBlacklight Schema

The GeoBlacklight Schema is a metadata schema for location based information discovery, andfocuses solely on discovery use cases (Figure 4.3). Text search, faceted search and refinement, andspatial search are among the primary features that the schema enables.

The REQUIRED metadata for GeoBlacklight includes the columns shown in Table 4.2 andAppendix C.1.

We also proposed additional metadata (Table 4.3) that will be used for faceted search and otherFront End features. The guideline for developing new metadata are: 1) use similar format to thosenatively in the GeoBlacklight schema, or 2) follow another metadata standard like ISO 19139 orFGDC to derive metadata for the GeoBlacklight schema.

36

http://georss.org#

http://geoblacklight.org/schema/1.0#


Table 4.2: REQUIRED metadata for GeoBlacklight

Column Descriptiondc_identifier_s Identifierlayer_slug_s Slugdc_title_s Titlesolr_geom Bounding Boxdc_rights_s Rightsgeoblacklight_version Schema Version

Table 4.3: Proposed additional metadata for Front End features

Metadata Column Description ExampleAuthor dc_creator_sm Document author Edward FoxDate dct_publish_dt Year of publication of

document2017

Institution dc_provenance_s Institution of data col-lection holder

Virginia Tech

Spatial coverage dct_spatial_sm Interpreted location asplace name

Blacksburg, Virginia

Spatial bounds dct_bounds_s Spatial bounding boxfor S,W,N,E (if not apoint)

(34.0, -118.2, 36.5, -117.4)

URL dc_url_s URL of webpage /tweet

http://xxxx.yy.zz

Content dc_description_s Full text of tweet /webpage

Full text

Topics dc_topics_s Related event or topic midflight experienceCollections dc_isPartOf_sm Related collection Solar eclipseGeom-type dc_geotype_s Point or polygon PointDoc-type dc_type_s Tweet or webpage webpagehash tag dct_hash tag_s hash tags of tweets #hurricane

37


Figure 4.3: GeoBlacklight modular architecture and inner relations to HBase table and Solr index

4.5.3 Map GeoBlacklight Schema with HBase Schema

The HBase team proposed an HBase data schema table for the structure and description of bothtweet and webpage collections. The column families include tweet, clean-tweet, webpage, doc-type, tweet-topic, and tweet-cluster. To transform the HBase column (to Solr index and then) toGeoBlacklight UI elements, as well as to provide direction for other teams’ effort on data collectionand development of the entire pipeline, we designed a data mapping table for such transformation(see Table 4.4).

Roles of metadatas in GeoBlacklight

The aforementioned metadata will be used to support text search and sorting, scoring formulas, spellchecking, or suggestions; see Figure 4.4.

4.5.4 Sample data and initial Front End prototype demo

Before the SOLR team provides the final data for GeoBlacklight to consume, we take a simpleJSON dataset with the required schema for our initial application testing and Front End prototypedevelopment. This dataset was originally developed by the 2016 FE team. Figure 4.5 is an exampleof the JSON data structure.

For the aim of providing the proposed features in the map module, we developed a web mappinginteraction Front End prototype to consume the sample JSON data. The two purposes of this demo

38


Table 4.4: Data mapping table among HBase, Solr, and GeoBlacklight (partial)

HBase Solr GeoBlacklightColumn Family Column Name Index UI Elementmetadata doc-type dc_type_s Document Typemetadata collection-id dc_topics_s Topictweet screen-name dc_creator_sm Authorclean-tweet clean-text-solrclean-tweet clean-text-claclean-tweet clean-text-ctaclean-tweet created-monthclean-tweet spatial-coord dc_coords_s Coordinatesclean-tweet spatial-bounding dct_bounds_s Boundsclean-tweet solr_gemo solr_geomclean-tweet hash tags dct_hash tag_s hash tagclean-tweet mentions dct_mention_s Mentionclean-tweet long-url dc_url_s URLclean-tweet sner-locations dct_spatialname_sm Locationwebpage title dc_title_s Titlewebpage author/publisher dc_creator_sm Authorwebpage sub-urls dc_url_s URLclean-webpage clean-text dc_content_s Contentclean-webpage spatial-name dct_spatialname_sm Location

are: to test spatial search with map bounding box, and attempt a map based visualization for a largenumber of data points (heat map and marker clustering).

Spatial search with map bounding box

A bounding box is an area defined by two lonGitudes and two latitudes. In the World GeodeticSystem WGS84 standard commonly used for web mapping, latitude is a decimal number between-90.0 and 90.0. LonGitude is a decimal number between -180.0 and 180.0. In the demo, each time auser zooms the map, the current view extent will be calculated. This extent is used as filter to refinethese tweets with geolocations. The refined results (counts, location) will be displayed back to themap in the “Map view” section, while the “Table view” section will also be updated to display onlythe filtered items at the same time (see Figure 4.6).

Map clustering visualization

The map marker clustering visualization uses a grid-based clustering technique. It divides the mapinto squares of a certain size according to current zoom level, and groups the markers into each

39


Figure 4.4: Data Structure of Sample JSON Used for Testing

Figure 4.5: Data Structure of Sample JSON Used for Testing

square grid. It creates a cluster (marker group) at a particular marker, and adds markers that are inits bounds to the cluster (see Figure 4.7).

Heat map visualization

We also tested a heat map visualization to better illustrate the spatial distribution pattern of tweets.The heat map represents intensity of tweets related to a place. In this color gradient overlay, areas ofhigher intensity are colored as red, and areas of lower intensity appear blue (see Figure 4.8).

40


Figure 4.6: Prototype demo: spatial search with map bounding box

4.6 Set Up Server and Environment

GeoBlacklight has several development requirements to launch. The list of the software you shouldhave installed on your development computer or server is:

• Java (1.8 or later)

• Git

• Ruby (2.2.7 or later)

• Rails (5 or later)

41


Figure 4.7: Prototype demo: map clustering visualization

4.6.1 VM and Ubuntu: Create a Development Virtual Machine

Most of the tutorials and resources are using Linux commands, so we used Linux, in Ubuntu 16.04.3LTS. We have started from scratch, using a new UbuntuOS installation in a Virtual Machine. Weused Oracle VirtualBox and UbuntuOS 16.04.3 LTS which is free to download online. You candownload them from the link below and detailed installation instructions can be found in AppendicesA.1 and A.2.:

• Virtual Box: https://www.virtualbox.org/wiki/Downloads

• Ubuntu: https://www.ubuntu.com/download/desktop

For the Ubuntu download link, if you scroll down, you will see the link, “Not now, take me tothe download” which will start downloading the CD, as shown in Figure 4.9.

Once you complete installing Ubuntu, you can start installing development software by usingterminal. To open terminal, click the top icon on the task bar, then type terminal. You will see the

42

https://www.virtualbox.org/wiki/Downloads

https://www.ubuntu.com/download/desktop


Figure 4.8: Prototype demo: heat map visualization

black terminal box, as shown in Figure 4.10.

4.6.2 Java: Install Java Open JDK on Ubuntu 16.04

Java JDK (Java Development Kit) is a fully loaded Development Kit that has everything that JRE(Java Run-time Environment) has, plus more features to create software or applications. Before westart installing JDK, we need to make sure that our Ubuntu server is perfectly up-to-date by usingthe following command. Figure 4.11 shows the result:

1 $ sudo apt−get update

After the server has been updated, we will install JDK. If your server already has Java, you cancheck your current version of Java with the following command:

1 $ Java −version

43


Figure 4.9: Ubuntu ISO download

However, our team’s Ubuntu is new, so we expect to not yet have Java. We will install the defaultversion of JDK at this time. You can install and verify that Java is successfully installed with thefollowing commands:

1 $ sudo apt−get install default−jdk2 $ Java −version

Figure 4.12 shows that the default-jdk command will automatically install JDK 1.3.0._131 whichfits to the requirement. Now, we have successfully installed Java 1.8.

4.6.3 Git: Installation and Configuration

Git installation

The simplest way to install Git is using Ubuntu’s default repositories. You can install the latestversion of Git by installing from source, but we do not need to have the most recent version, so wewill use the easier way.

44


Figure 4.10: Open Terminal

Figure 4.11: Update Ubuntu Server

45


Figure 4.12: Java Installation and Version Check

Figure 4.13: Git Installation and Version Check

If you have not updated the server yet, you will need to use the apt package management toolsto update your Ubuntu. Afterwards, you can download and install the Git with the second commandbelow:

1 $ sudo apt−get update2 $ sudo apt−get install Git

Figure 4.13 shows the results after the commands.

46


Figure 4.14: Git Installation and Version Check

Configuring Git

The reason for using Git is for our version control system later on. So we are going to set it up tomatch our GitHub account. If you or your teammates do not have a GitHub account, make sure toregister here: https://GitHub.com/

You can set up to match by the following commands. You need to replace the name and emailaddress in the following steps with your GitHub account name and email.

1 $ Git config −−global color.ui true2 $ Git config −−global user.name "YOUR NAME"3 $ Git config −−global user.email "[email protected]"4 $ ssh−keygen −t rsa −b 4096 −C "[email protected]"

After that, you need to copy the new SSH key and add it to your GitHub account. Copy theoutput from the following command:

1 $ cat ~/.ssh/id_rsa.pub

When you paste the following command, you will get output as shown in Figure 4.14. Now, youneed to paste the output to your GitHub account’s SSH key list. Here is the link; it may ask you tologin first: https://GitHub.com/settings/keys

47

https://GitHub.com/

https://GitHub.com/settings/keys


Figure 4.15: Git Configuration Verification on the GitHub Account

Figure 4.16: Git Configuration Verification on the Ubuntu

Once you add your new SSH key, you will see the new key is on the SSH key list of your GitHubaccount, as in Figure 4.15, and you can also check with following command:

1 ssh −T [email protected]

As you see in Figure 4.16, you should get a message like this:

1 $ Hi [NAME!] You've successfully authenticated, but GitHub does not provide shell access.

Now, you have successfully installed and matched with your GitHub account. For more detailedinstallation/configuration of Git, please go to Appendix A.3.

4.6.4 Ruby Installation

Update Ubuntu and Install Dependent Software

First, we need to install some software that are required for Ruby installation. Again, if Ubuntu isnot yet updated, we also need to update before starting software installation.

1 $ sudo apt−get update2 $ sudo apt−get install Git−core curl zlib1g−dev build−essential libssl−dev libreadline−dev3 $ sudo apt−get install libyaml−dev libsqlite3−dev sqlite3 libxml2−dev libxslt1−dev4 $ sudo apt−get install libcurl4−openssl−dev python−software−properties libffi−dev nodejs

Figure 4.17 shows the result after updating the system and installing the software.

48


Figure 4.17: Update Ubuntu and Install SW

Installation

There are many ways to install Ruby. We will install Ruby version 2.4.2, using rbenv. To installrbenv, we need to follow two processes:

• install rbenv

• install ruby-build

The following commands need to be done, to download and install Ruby, using rbenv:

1 $ Git clone https://Github.com/rbenv/rbenv.Git ~/.rbenv2 $ echo 'export PATH="$HOME/.rbenv/bin:$PATH"' >> ~/.bashrc3 $ echo 'eval "$(rbenv init −)"' >> ~/.bashrc4 $ exec $SHELL5

6 $ Git clone https://Github.com/rbenv/ruby−build.Git ~/.rbenv/plugins/ruby−build7 $ echo 'export PATH="$HOME/.rbenv/plugins/ruby−build/bin:$PATH"' >> ~/.bashrc8 $ exec $SHELL9

10 $ rbenv install 2.4.211 $ rbenv global 2.4.2

Figure 4.18

Install Bundler

After the installation is done, you can check your Ruby version by the command below. Also thelast step is to install Bundler with the second command. The result is as in Figure 4.19:

49


Figure 4.18: Install Ruby using rbenv

Figure 4.19: Ruby Version Check and Install Bundler

Figure 4.20: Download NodeJS

1 $ ruby −v2 $ gem install bundler

We need to run this code after installing bundler to use rbenv:

1 $ rbenv rehash

4.6.5 Rails Installation

NodeJS Installation

First, we need to install NodeJS, using the official repository by:

1 $ curl −sL https://deb.nodesource.com/setup_8.x | sudo −E bash −

Figure 4.20 shows the results after the commands. After that, you can install nodejs and verifyby using it with -v in the command:

1 $ sudo apt−get install −y nodejs

50


Figure 4.21: Install NodeJS and Version Check

2 $ nodejs −v

Figure 4.21 shows the results after the commands for installing NodeJS and Version Check.

Rails Installation

Now, install Rails, version 5.1.4 which is the recommendation from the resource discussed inAppendix A.3. Again, you can make sure that Rails is successfully installed by checking the version:

1 $ gem install rails −v 5.1.42 $ rails −v

Figure 4.22 shows the result after installing Rails. It also shows that the version is 5.1.4. We areusing rbenv, so we need to run the following command to make the Rails executable available:

1 $ rbenv rehash

4.6.6 Set up Relational Database (MySQL)

You can use PostgreSQL, SQLite, and MySQL. However, at this point, the FE team will use MySQL,because the tutorial from Appendex A.3 did not recommended SQLite. The MySQL server can beinstalled in Ubuntu as below. You will need to set the password for the root user, and it will be savedinto the Rails app’s database.yml file in the future:

1 $ sudo apt−get install mysql−server mysql−client libmysqlclient−dev

Also, please see Appendix B.1 for the SQL installation Error Solutions.

51


Figure 4.22: Rails 5.1.4 Version Installation

Figure 4.23: Create New Rails Application

4.7 Create Application

4.7.1 First Rails Application

Now, we will create our first Rails Application with the following commands, to create a new Railsapp in the project name, "cs5604fe" (Figure 4.23):

1 $ rails new cs5604fe −d mysql

After that, move into the project folder, create the database, and launch the server (Figure 4.24):

1 $ cd cs5604fe2 $ rake db:create3 $ rails s −b 0.0.0.0

You can now visit http://localhost:3000 to view your new website (Figure 4.25)!

52


Figure 4.24: Launch Server

Figure 4.25: First Rails Application

4.7.2 Install GeoBlacklight

We have successfully installed all the development software and set up the server. Now, we willinstall GeoBlacklight. We cannot see the GeoBlacklight homepage at this time, because we have notset up and connected with the Solr system yet, but we will install GeoBlacklight, in order to launchthe homepage in the next week with the Solr [26] connection.

First, you need to add GeoBlacklight to your Gemfile in your new Rails application folder.[22] A Gemfile is a file used for describing gem dependencies for our Ruby program. A gem is acollection of Ruby code that we can extract into a “collection” which we can call later. For our team,the folder name is cs5604fe. You will see the Gemfile, if you use the ls command (Figure 4.26).We used vim to edit the Gemfile, which can be found in Appendices A.4 to A.6. You can downloadand install vim with the following command:

1 $ sudo apt−get install vim

53


Figure 4.26: Access Gemfile

Now, after you find the Gemfile in the directory, using an editor, add GeoBlacklight to yourGemfile.

1 $ ls −l2 $ sudo vim Gemfile

As Figure 4.27 shows, go to the end of the Gemfile, press i to insert the following lines:

1 $ gem 'blacklight'2 $ gem 'GeoBlacklight'

After you copy the lines in the Gemfile, you can save the file and quit the vim by the followingcommand:

1 :wq

Now, install the required gems, GeoBlacklight, and the dependencies, with the followingcommand:

1 $ bundle install

Figure 4.28 shows the result after installing bundles.After installing all the required gems and their dependencies, you need to run the Blacklight

generator with devise authentication. Run the GeoBlacklight generator, default Solr config, anddatabase migrations by the following commands:

1 $ rails g blacklight:install −−devise2 $ rails g geoblacklight:install −f3 $ rake db:migrate

54


Figure 4.27: Edit Gemfile

Figure 4.28: Bundle Installation

Figure 4.29: GeoBlacklight Launching

Now, you can start the Solr and Rails server by the following command. If you do not have Solr,the machine will download and start a new Solr server by running this command:

1 $ rake geoblacklight:server

Finally, you will see GeoBlacklight is working! Navigate to http://localhost:3000 orhttps://127.0.0.1:3000, then you will see the GeoBlacklight homepage. Figures 4.29 and4.30 show the result.

55


Figure 4.30: http://localhost:3000

4.7.3 Install default GeoBlacklight

Another way to install GeoBlacklighti, i.e., a simpler way, is installing the default version ofGeoBlacklight. You can skip Sections 4.7.1 and 4.7.2. You can do all that is involved using thefollowing commands:

1 $ rails new geoblacklight −m https://raw.Githubusercontent.com/geoblacklight/geoblacklight/master/template.rb

After that, move into the newly created geoblacklight folder and run GeoBlacklight to set upthe program. Even though the program does not run due to the configuration file issue, it is okay.If you install this default GeoBlacklight, you will use sqlite3 for the database. Try the followingcommands to launch:

1 $ cd app−name2 $ rake geoblacklight:server

56


Figure 4.31: Relaunch GeoBlacklight after adding new Schema

4.8 Index Solr

GeoBlacklight uses GeoBlacklight-Schema as a template for metadata documents indexed by Solr.Since we do not have our own dataset yet, we will use GeoBlacklight’s rake command to index asmall set of documents. The example is from the GeoBlacklight tutorial [24].

Now, you can follow the next commands to download metadata documents that fit the GeoBlacklight-Schema. First, you need to create a directory for the new Schema, move to the directory that youjust created by using the cd command, and download the given data to the directory you just createdby using the command:

1 $ mkdir −p spec/fixtures/solr_documents2 $ cd spec/fixtures/solr_documents3 $ curl −O https://gist.Gith.....

The given metadata document is in JSON format. The actual address to download the file can befound in Appendix A.7.

Now, move back to the top of our application root directory, and start the Solr server and Railsapplication as below. Figure 4.31 shows the result after you run the command:

1 $ rake geoblacklight:server

As you can see in Figures 4.32 and 4.33, now we can use GeoBlacklight for just generalsearching and faceted searching. In the future, we will customize our GeoBlacklight with newfeatures and a customized test dataset. However, as the Figures 4.32 and 4.33 show, it does not workproperly as we wanted. The reason is that we need to have a correct type of data and the columnname from the SOLR team that GeoBlacklight requires. The explanation about this issue is given inSection 4.5.2.

57


Figure 4.32: GeoBlacklight with new Schema

Figure 4.33: GeoBlacklight with Faceted Searching

Figure 4.34: SSH to Solr index server

4.9 Connection with Another Solr

We did a SSH tunneling to the Solr index server to forward the remote 8993 port to our local 9983port (Figure 4.34).

Then, we launched the Rail Server to launch GeoBlacklight. We only need to launch theGeoBlacklight application, not the default Solr that is installed in the folder (see Figure 4.35).

As Figures 4.36 and 4.37 show, the new data from the SOLR team’s Solr core is shown, butdoes not show searching or all the information. However, in the next section, we will show the nextversion of our GeoBlacklight with the Solr that has all the required fields.

Finally, you can copy the deployed GeoBlacklight on the server where you would like to host

58


Figure 4.35: Run Rail Server for Launching GeoBlacklight



59


Figure 4.38: Update Solr server connection in blacklight.yml

the service. Our team used Virginia Tech’s DiGital Library Research Laboratory’s server 1. Wecopied our program into the cs5604f17_fe cluster. Below is the command that we used to login tothe server. The password is given by Prof. Fox2, who is teaching this class CS 5604 in Fall 2017.Where a cursor would be blinking after the word, "password: ", enter the password that you aregiven for the server/cluster.

1 $ ssh [email protected] $ [email protected]'s password:

There are two ways to copy the project into the server: 1) copying the project from the localmachine to the server using scp commands3 or 2) clone4 the project from GitHub.

After copying the project, we had to make change to the configuration file to change the Solrconnection setting.

After copying the final version of the software, GeoBlacklight, the program can be run with thesame command that we used. Previously, we ran the Solr tunneling and then launched GeoBlacklight.However, once we change the Solr Configuration in the blacklight.yml file that is under the configfolder, we can just run GeoBlacklight with the Solr that we want.

First, go to the folder of the program. For example, if your GeoBlacklight’s application name isgbl, then run the following commands to go to the directory that has blacklight.yml. You can usevi/vim to edit the file. For more vi commands, please see the footnote5.

1 $ cd gbl/config2 $ vi blacklight.yml

As you can see in Figure 4.38, the development section of blacklight.yml files needs to be fixedwith the correct Solr address with the correct Folder.

After that, we can launch GeoBlacklight on the server. The port 3000 was already used byanother program for the server, so we had to ask the GTA of this class to open another port for our

1http://www.dlib.vt.edu/2http://fox.cs.vt.edu3https://www.garron.me/en/articles/scp.html4https://help.Github.com/articles/cloning-a-repository/5https://www.cs.colostate.edu/helpdocs/vi.html

60


Figure 4.39: Launch Rail on the mule.dlib.vt.edu Server

program, which was 3033. Now, you can launch the program with the following commands. Theserver URL was mule.dlib.vt.edu.

1 $ rails server −b 0.0.0.0 −p 3033

After that, you will be able to see that your program is running on the server. Our program’sURL is mule.dlib.vt.edu:3033. Figure 4.39 shows the program is running under Linux. Also, thebrowser shows that GeoBlacklight is running successfully on the local server.

Finally, you can use the screen command to let the server run the program, even though youclose your terminal. With another terminal window, login to the server by using ssh as we mentionedearlier in Section 4.9, then go into the application folder by using the cd command6. After that, typescreen. Below are the commands that you should use with the new terminal window. We assumethat gbl is the application name.

1 $ ssh [email protected]

6https://en.wikipedia.org/wiki/Cd_(command)

61


2 [email protected]'s password:3 $ cd gbl4 $ screen

You will be able to stop screen by checking the list of screen and remove it by screen -ls andscreen -r [number]7.

4.10 Final Version of GETAR Portal

4.10.1 Development History

After we successfully launched GeoBlacklight with a connection with the CS5604f17_solr team’sSolr, we started work on the GeoBlacklight customization and configuration. Figure 4.40 shows thefirst version of implemented GeoBlacklight. The geospatial search did not work, because of Solrmatching, but our team collaborated with the SOLR team to solve the issues. Also, the searchingfunction was worked, as Figure 4.42 shows, except for spatial searching. The reason that it did notwork was that we did not have formatted geospatial data from Solr. It was not really easy to get theformat that fits with the GeoBlacklight requirements; it took some time to generate and insert intothe Solr core. We also removed unnecessary quick links on the front page, added a newly createdGETAR [25] logo on the top, and resized the map on the main page (see Figure 4.41).

4.10.2 Functionalities of GETAR Portal

Figure 4.43 shows the final version of the GeoBlacklight application for GETAR. We changed thebackground theme with black color and added Subject Collection and hash tag features. There arefour main features of the final version of GeoBlacklight, which are: facet searching, geo-spatialsearching, history records, and user info. Figure 4.45 shows that our final version of GeoBlacklightshows facets on the left. As you can see in Figure 4.46, there are 8 facets that the user can use tofind data with GeoBlacklight: Collection, Subject, Source Type, Year, Author (the Tweeter accountname), Place, Data Type, and hash tags. Additionally, a user can create, modify, and cancel his/heraccount as Figure 4.47 shows and also he/she can keep track of searching history as you can see inFigure 4.44. The user can save the history, remove from the history list, and reset the history.

4.10.3 Future Work

In the future, we hope that the Front End team of the next year will use our research result toefficiently deploy GeoBlacklight. Moreover, I think it would be great if they can deploy more

7https://www.tecmint.com/screen-command-examples-to-manage-linux-terminals/

62


Figure 4.40: First Version of the GeoBlacklight

Figure 4.41: Second Version of the GETAR Portal with GeoBlacklight

63


Figure 4.42: Second Version GETAR Portal’s Searching Result

Figure 4.43: Final Version of the GETAR Portal

64


Figure 4.44: Final Version GETAR Portal’s History Page

Figure 4.45: Final Version GETAR Portal’s Searching Page

65


Figure 4.46: Final Version GETAR Portal’s Facet Searching Sidebar

Figure 4.47: Final Version GETAR Portal’s User Page

66


customizations that users would like to use. For example, it would be great if the future FE teamdevelops a way to show actual Twitter pictures or a cleaner view for the searching. We also wouldlike to encourage working on better integration of visualization and data search, spatial and temporalrepresentation and visualization of data, and document recommendation based on user search historyand/or user profile.

Also, we would like to leave the evaluation part of the current GeoBlacklight as another futurework. Unfortunately, we did not have enough time to complete the whole process of the research, butwe strongly think that completing implementation of GeoBlacklight on the server, and connectingwith Solr, was meaningful success. We hope that this project work can enable GeoBlacklight usage,more advancement in the Information Retrieval area, and also help future FE teams.

67

Chapter 5

Visualization Design: Implementation, andEvaluation

Visualization is an important component of the FE team project, to map users’ preferences throughvisual channels and interactions, and show the most relevant data elements to explore. To provide aneffective and efficient visualization tool for GETAR users, there are four fundamental phrases tofollow and carry out: data abstraction to know the abstract types to be visualized, task abstraction

to understand the manner people use the visualization tool, design to map the user tasks to the actualvisual components, and implementation to create and develop the visualization system.

This chapter introduces the data abstraction, task abstraction, design, and implementation ofa tweet visualization system – TweetBank. TweetBank is a facet-based visual analytic system forexploring large tweet datasets, which is implemented with D3, Solr REST API, Jena data API, andPHP. TweetBank is comprised of a query interface and multiple interactive visualizations. The queryinterface obtains users’ searching criteria and scopes the range of tweet data. The visualizationspresent the faceted data for social connections, user demographics, time stream, word cloud, andgeo-locations. With the query interface and interactive visualizations users can iteratively exploredifferent aspects of the tweet data, through which they can learn by reading different visualizations.

5.1 Data Abstraction

As is discussed in related work, there are two main parts of the data collected and processed by allteams: the webpage documents and the tweet data. Our original plan was to make an integrated

Chapter 5. Visualization Design: Implementation, and Evaluation 69

Figure 5.1: Data Abstraction (figure from [13])

interface for both tweets and webpages. After talking to the experts and professionals, we decided tofocus our design and implementation on the tweet repository. The reason why we think the two typesof data should not be presented on one interface is because tweets and webpages share few commonattributes and they are used to analyze different information. For example, tweets contain short text,but much information on social activities such as mentioning, retweeting, replying, and hash-tagging.The tweeters also vary in their background – some have higher followers/friends like public figureswhile others have relative few, like common Twitter users [16, 4]. With such information, users canlearn social aspects of an event. In contrast, webpages usually are written by professional editorswho aim to give a comprehensive description to the events, including background, descriptions,statistics, public opinions, etc. [10]. Therefore this source of information is more suitable to learnabout documented facts. Considering tweet data is multi-dimensional and its visualization with Solrqueries is under-explored, we decided to investigate how to use Solr queries and D3 to visualizedifferent aspects of tweet data. Tweet data is more suitable to use D3 visualization tools such astables and networks (Figure 5.1).

The tweet data (JSON file provided by the GTA) from the Twitter REST API contains manyfields, which can be categorized into two types: fields describing tweet information, and fieldsdescribing user information. Tweet metadata includes fields such as full text, retweet/mentioninformation, geo-location comment count, retweet count, favorite count, and created time. The userobject nested in the parent object contains user information such as ID, follower number, friendnumber, total tweet number, and language preferences. These fields are processed by the CMT and

69


SOLR teams, and indexed in Apache Solr [26] and Apache Jena [27]. Solr indexes metadata oftweets and Jena indexes triples of tweet communication. A full JSON object containing the processedtweet and user information is shown below. With this JSON data, we can visualize information suchas temporal information, social communication information, geospatial information, and textualinformation.

{

"user_statuses_count":865,

"retweeted":"true",

"user_screen_name":"ecesorize",

"user_location":"San Francisco, CA",

"place_country_code":"",

"in_reply_to_user_id_str":"null",

"user_friends_count":"6",

"user_deleted":"false",

"dct_isPartOf_sm":[

"DiamondRing"

],

"cluster":"DiamondRing",

"user_name":"ecesorize.com",

"category":[

"",

"NOT2017EclipseSolar2017"

],

"user_mentions_id_str":[

"248683068"

],

"contributor_enabled":"false",

"retweet_count":4,

"archive_source":"twitter-search",

"tweet_deleted":"false",

"dc_source_sm":[

"<a href=\"https://about.twitter.com/products/tweetdeck\"

rel=\"nofollow\">TweetDeck</a>"

],

"full_text_ner":"RT @CarlSian : Prepare for #Totality huge

#Astronomy event https://t.co/crWFlIN7Tp total

70


#SolarEclipse this fun t-shirt #Astrology #Illinoi ... ",

"dc_creator_sm":[

"ecesorize"

],

"user_lang":"en",

"lang":"en",

"id":"889151367612432385",

"layer_id_s":"889151367612432385",

"user_profile_image_url":

"http://pbs.twimg.com/profile_images/2254999521

/ecesorize-logo1_normal.PNG",

"dct_publish_dt":"2017-07-23T15:53:00Z",

"created_at":"2017-07-23T15:53:00Z",

"favorite_count":0,

"dct_hash tag_sm":[

"#Totality",

"#Astronomy",

"#SolarEclipse",

"#Illinoi"

],

"full_text":"prepare totality huge astronomy event total

solareclipse fun tshirtastrology illinoi .",

"coordinates":"",

"dct_mention_sm":[

"Carl Sian"

],

"user_mentions_name":[

"Carl Sian"

],

"user_id_str":"591962303",

"urls":[

"http://amzn.to/2tBmrTY"

],

"dc_title_s":"ecesorize",

"user_followers_count":14,

"topics":[

71


"midflight experience",

"forecast",

"safety"

],

"dc_subject_sm":[

"midflight experience",

"forecast",

"safety"

],

"source":"<a href=\"https://about.twitter.com/products/tweetdeck\"

rel=\"nofollow\">TweetDeck</a>",

"dc_description_s":"prepare totality huge astronomy

event total solareclipse fun tshirtastrology illinoi .",

"user_mentions_screen_name":[

"@CarlSian"

],

"place_full_name":"",

"user_favourites_count":121,

"hash tags":[

"#Totality",

"#Astronomy",

"#SolarEclipse",

"#Illinoi"

],

"_version_":1586189658955448300,

"geoblacklight_version":"1.0",

"layer_geom_type_s":"Point",

"dct_provenance_s":"Virginia Tech",

"dc_format_s":"tweet",

"layer_slug_s":"",

"dc_rights_s":"Public",

"solr_year_i":2017

}

After the SOLR team has indexed these fields, the FE team can generate queries and use the SolrREST API to get the tweet JSON objects. A Solr query, which is a typical HTTP request, can be

72


divided into two parts: JSON URL and parameters. The JSON URL is the API URL provided by theSOLR team. Please note this address can be visited only after the SSH tunnel has been set up. The tun-nel command used in this semester is "ssh -L 9983:solr2.dlrl:8983 cs5604f17_fehadoop.dlib.vt.edu-N". For our visualization implementation, we use the following URL to query Solr:

http://localhost:9983/solr/getar-cs5604f17-solr-collection_1_shard1_replica1/select

"9983" is the tunnel port. "getar-cs5604f17-solar-eclipse_shard1_replica1" is the name of thecore. "select" is the query type.

The second part of the query has parameters. These parameters are appended to the URL with"?" and start with "q=". For example, to show the default results, the query will be:

http://localhost:9983/solr/getar-cs5604f17-solr-collection_1_shard1_replica1/select?q=*:*

The Solr REST API provides multiple functions to customize the returned JSON. The queryfunctions provided by Solr can be roughly categorized into four main types: matching, faceting,grouping, and filtering. Matching limits the tweet documents that will be returned. They are in theformat of "<field_name>:<value>". For example, if the user searches the keyword "eclipse", thecorresponding constraint will be "full_text:eclipse".

http://localhost:9983/.../select?q=full_text:eclipse

Multiple constraints can be concatenated with AND, OR, and NOT operations. For exampleif the user wants to search tweet text with the word "eclipse", but exclude posts by users whospeak English, the parameters will be "full_text:eclipse AND NOT user_lang:en". The NOT can bereplaced by "-". Parentheses can be used to nest parameters. For example "(full_text:eclipse ORfull_text:totality) AND -user_lang:en" calls for tweets containing "eclipse" and "totality", but notposted by tweeters who speak English. To match anything, a start ("*") can be added for approximatesearching. For example if the user wants to search for user locations that contain VA or Virginia, thequery will be:

http://localhost:9983/.../select?q=user_location:*VA* OR user_location:*Virginia*

This query will match user locations such as "Blacksburg,VA" or "Virginia,USA". For numbervalues the constraint can be expressed with "[<start value> TO <end value>]". For example the tweetscreated between 08/21/2017 00:00:00 and 08/21/2017 11:59:59 can be queried by "created_at:[2017-08-21T00:00:00Z TO 2017-08-21T11:59:59Z]. This time stamp is in the UTC timezone.

The second type of parameter is faceted search. It is a unique function provided by Apache Solrto aggregate the results and obtain the statistics about a category of data. To perform facetedsearch the facet parameter must be set to true by "facet=true". Then for each facet, add a"facet.query=<field_name>:<value or range>". For example, to obtain the numbers of tweetsin three follower number ranges, the following query can be used:

73


http://localhost:9983/.../select?q=*:*&rows=1&wt=json&indent=true&facet=true &facet.query=user_followers_count:[0 TO 99] & facet.query=user_followers_count:[100TO 999] & facet.query=user_followers_count:[1000 TO 9999]

The result will be:

"facet_counts": {

"facet_queries": {

"user_followers_count:[0 TO 99]": 1528445,

"user_followers_count:[100 TO 999]": 3174622,

"user_followers_count:[1000 TO 9999]": 1277697

},

"facet_fields": {},

"facet_dates": {},

"facet_ranges": {},

"facet_intervals": {}

}

The effect of the query is faceting the field user_followers_count, and aggregating the results by thethree value ranges. The facet_count will show the number of tweets in each range, which can usedto generate a visualization.

Another query type we used in our project is grouping. This query is used to find andcount unique users among the search results. After setting the matching parameters, query like"group=true&group.field=user_id_str" will aggregate the documents into groups, with one groupcontaining all documents posted by the same user. This step is a computationally intensive operation.It will slow down the processing of the query. We limit the row number to ensure a reasonable waitingtime. For the Solr core provided by SOLR team, we use row count lower than 1000 in our implemen-

tation.http://localhost:9983/.../select?q=full_text:eclipse&rows=100&wt=json&indent=true&group=true&group.field=user_id_str

The last type of parameters includes filters for JSON objects and fields. "fq" limits the globaldomain of the search. "rows=5" limits the returned objects in the doc list to 5. "fl=full_textcreated_at" makes the object only contain the fields "full_text" and "created_at". "sort=<filed>ASC/DESC" can order the returned JSON objects. For the information need "show the full text andfollower # of 5 tweets containing ’eclipse’, where the results are sorted by follower #", the syntaxwill be:

http://localhost:9983/.../select?q=full_text:eclipse&rows=5&fl=full_textuser_followers_count&sort=user_followers_count DESC&wt=json&indent=true

Then the returned JSON will be:

74


{

"responseHeader": {

"status": 0,

"QTime": 62,

"params": {

"sort": "user_followers_count DESC",

"fl": "full_text user_followers_count",

"indent": "true",

"q": "full_text:VA",

"_": "1510157205031",

"wt": "json",

"rows": "5"

}

},

"response": {

"numFound": 5016,

"start": 0,

"docs": [

{"full_text": "RT @Rebmangas: El #Eclipse de este 21 de agosto

inicia en el Océano Pacífico y termina en el Océano Atlántico,

en la franja el eclipse va a...",

"user_followers_count": 2235887},

{"full_text": "Si no me dejaron ciego los putazos que me ponía mi

mamá por no obedecer, ya parece que me va a dejar ciego un

#Eclipse",


{"full_text": "Así va el #eclipse2017 desde #SouthCarolina donde

@alvarovr está presente. Vayan a https://t.co/8IBSVUjkDh para

verlo En Vivo! https://t.co/4yrZMWV2op",


{"full_text": "RT @tcd004: The view from Arlington,

VA #Eclipse2017 #EclipsePBS https://t.co/09f0OjILYX",


{"full_text": "How to watch today’s solar #eclipse in Washington,

Va., and Md.: https://t.co/gmHn8FESgr",

75


"user_followers_count": 960961}

]

}

}

The social network graph is created with both the Solr API and Jena API. Apache Jena is anopen source Semantic Web framework [27]. It provides an API to extract data from RDF graphs[34]. RDF stores data in "triples" – subject-predicate-object. In this semester project, the triplesdescribe the social communications extracted from the JSON data provided by the GTA. Jena APIcan return JSON as results. Subjects and objects are the user IDs. Predicates, or the relationshipsbetween subjects and objects, are the at-mention information. The CMT team defines the syntax ofthe API query. Subjects and predicates are in the form of URL, and the objects are plain ID strings.The query to get the first 1000 triples is:

http://mule.dlib.vt.edu:3030/getar/query?query=

prefix sub: <http://example.org/>

prefix pred: <http://xmlns.com/SNR/0.1/>

SELECT ?s ?p ?o WHERE { ?s ?p ?o} LIMIMT 1000

The first line is the URL of the Jena API provided by the CMT team. The second and third line arethe prefix of the subject and predicate. The last line is the RDF query. Entities starting with questionmarks are variables to appear in the returned JSON results. To query all the users who mention theuser with ID 771064506, the query is:

http://mule.dlib.vt.edu:3030/getar/query?query=

prefix sub: <http://example.org/>

prefix pred:<http://xmlns.com/SNR/0.1/>

SELECT ?771064506 WHERE {?771064506 pred:mentions "771064506"}

In this query, we name the returned variable as ’?771064506’. This is because by doing this the userID will be the key in the returned JSON, which helps process the data. "pred:mentions" means welook for the relationship called "mentions", and "771064506" means the value is 771064506 – theuser ID. The returned JSON is:

{

"head":{

"vars":[

"771064506"

]

76


},

"results":{

"bindings":[

{

"771064506":{

"type":"uri",

"value":"http://example.org/434682005"

}

},

{

"771064506":{

"type":"uri",


}

},

{

"771064506":{

"type":"uri",


}

},

...

]

}

}

In the returned JSON, "bindings" is a JSON array that contains all the connections. The "value" isthe RDF URL. The number at the end is the user who mentions "771064506". We can process thisJSON and create a social network graph.

5.2 Task Abstraction

Twitter is a massive social networking site tuned towards fast communication. Twitter has played aprominent role in socio-political events: elections, disasters, and so on. After talking with severalexperts / professors about their requirements, we found the first and most important task is lettingusers efficiently and effectively find the most relevant subset. Through visualization techniques,

77


Figure 5.2: Task-Abstraction [12]

like zooming and panning, overview and detail, we could provide user interactions to express theirpreference efficiently and let users get the most relevant subset they need.

Prior studies suggest that categorizing social media data and studying emerging topic categoriesis an important pathway to derive knowledge from social media repositories such as tweets. Thecategories can be identified by machine-learning techniques or user interactions, or a combinationof the two. Studying characteristics of an individual category or comparing categories, analystsinterpret features of tweet groups, learn trends of topics, and identify similarities and differencesbetween the different elements. Goulding summarized grounded methods in studying social mediadata [12]. Identifying research problems and encoding tweets into concepts centers the building oftheories (Figure 5.2). From the literature we think TweetBank needs to support the identification

78


and refinement of the tweet categories, and the comparison of different categories. We choose toallow users to define the categories by searching keywords and hash tags to do the initial filtering.As the search result may contain hundreds (or even thousands) of tweets, users need methods tofurther filter the results and understand the results by visualization. Widgets and tools that allowthe user to filter the results need to be considered in the interface design. During the filtering, avisualization for the basic metrics such as retweet number will help the user assess the subset oftweets, and a visualization of key information such as top hash tags or keywords could keep the useraware of key information. After filtering all categories, the user needs to compare the metrics amongall categories.

The conceptualization of categories in tweet analytics could be achieved with multiple channels.The users can search one term or a phrase to obtain a set of tweets which contain the keyword(s).Other properties of tweet data can also be used to capture semantics. For example if a user wants tostudy the number of tweets posted before and after the 2017 solar eclipse, the analyst might considerforming two time ranges. For analysts who care about the user categories they might define a typeof tweeter by their follower or friend numbers. Our system should support creating concepts bydifferent manners, such as keyword searching, time range picking, and category selection.

After defining the concepts, a visualization application with multi-views can show details ofeach concept. For example, a geo-map could show the tweet locations; a tag cloud could presentcurrent most important tweet contents through keywords; a time-line shows the trends of searchedkeywords, etc.

5.3 Design and Prototype

Based on the abstract data block that we described in Section 5.1, and the abstract tasks in Section5.2, we should create and manipulate the visual representations for our visualization applicationsystem specifically.

Based on the idioms described in the related work, there are two major concerns in our visualinterface design: one set of design choices covers how to create a single image of the data: we willdivide our visualization system into two parts: the query part and visualization part.

The query part of our visualization limits the global scope of the data. In our design, the systemwill provide three types of constraints: keyword, field range, and time range. The keyword constraintasks the user to type the words to search. Only the tweets containing the keywords are added foranalysis. Field range allows the user to select one attribute and pick a range of the data. This sectionfilters the tweet by metadata such as number of likes, number of retweets, or number of comments.Time range allows the user to select a specific time period to investigate. Only tweets posted betweenthe start and end time will be included for visualization.

79


Based on the query result, visualization should provide all the important results at once, throughmulti-views. To let users make sense of all the visualized data as a whole, brushing and linking isnecessary.

Figure 5.3: Prototype of Visualization Design

5.3.1 Dynamic Queries: Searching and Filtering

Dynamic queries continuously update the data that is filtered from the database and visualized. Thedynamic searching and filtering interaction will be mainly based on keyword selection and featuresfiltering through slider bars. Through searching keywords to narrow down other numerical features,users could get the relevant small subset of elements that they have the ability to explore and makesense of.

5.3.2 Visualization with Multi-views

The visual encoding idiom controls exactly what users see, since there are lots of attributes to displayat the same time. These include the tweet text content, the overall tweet likes distribution, tweetdistribution, geo-location, tweet contents, the social network data between tweeters, and tweets. So,we will use multi-views [20] to show and update all the features at the same time, based on userinteractions. Please refer to Section 5.4 for more details about the design and implementation.

80


5.4 Implementation

5.4.1 System Overview

Based on our design and prototype, we have implemented a basic version of a visualization systemfor large collections of tweets. For the visualization (see Figure 5.4), there are three main parts:(a) the query interface, (b) the tweet card list, and (c) visualization views. The query interfaceprovides searching and filter bars to let users search keywords with filters. The tweet card list ismainly used to show users the search results, and provides more interactions to filter the tweet listthrough clicking key words. The visualization views include several sub-views: social networkgraph, histogram, time-line, tag cloud, and geo-maps – to provide users all the important informationat once.

Figure 5.4: Overview of TweetBank

Our system is implemented with two parts: front part and server part. The front part containsJavaScript, JQuery [31], and D3 [29]. We also use bootstrap [30] to implement common webwidgets such as time picker and drop-down list. The server part is implemented with PHP. As theSolr API is not open to public and only allows localhost visit, we need to use PHP (PHPRouter.php)to execute the query to the server and send the returned JSON back to the front part. For thesocial network, as we need parallel execution of a Jena query, we need to use another PHP file(PHPSocialNetwork_Jena.php) to perform parallel query. From the user input to the visualizedresults, the flow is depicted in Figure 5.5

81


Figure 5.5: System implementation overview

5.4.2 Query interface

The visualization offers two ways to filter the search results: limiting the range and selecting wordenclosure. Our system design is illustrated in Figure 5.3. On the left is the searching and filteringarea. The user can create categories with tabs. The category tabs have the same interface, but theuser can use them to scope different sets of tweets. The initial tweet list is generated by searchingkeywords or hash tags. The user types in keywords, then the search key is sent to Solr to retrieveJSON objects. The SOLR team is responsible to return to the Front End a JSON containing thematched tweets, with attributes containing time-stamp, number of likes, number of retweets, numberof comments, and so on. For webpages, other matrices will be discussed with the responsible team.Then the user can filter the results by selecting an attribute range. For example, if the user onlywants to look at tweets/webpages between 2007 and 2017, he will pick the attribute [time], andthe attribute selection will show “from [ ] to [ ]” boxes to allow the user to select the time range.Widgets for different types of attributes will be assessed during the implementation.

The second filtering method is word enclosure. As the tweets are tagged with part-of-speechand name entity recognition, keywords will be highlighted in the tweet box (Figure 5.7). Clickingon the keywords allows the user to filter the tweets. After the user clicks on one word, only thetweets containing that word will be shown in the visualization. Tweets without the word will begrayed out in the search result. The user can click on multiple words, to include more tweets in thevisualization.

82


Figure 5.6: Query Interface

5.4.3 Tweet Card

Tweets are the most important query results. To show the most relevant tweets, and let usersefficiently gain insights regarding current query results, we design and implemented the tweet card(Figure 5.7) based on the tweet list used by Twitter.com and Trello Card used by Trello.com. All theimportant metadata of tweets should be shown: date, author name, author ID, tweet content, hashtags, and the author avatar. What’s more, to let users filter query results quickly, we highlighted theimportant keywords (entities) in tweet content, such as NER (Named Entity Extraction) or POS(Part-Of-Speech tagging). Those highlighted entities could be clicked and used to highlight all othertweets in the list.

5.4.4 Visualization with Multi-views

After searching and filtering, the user can see the visualization on the right part of the screen. Thispart provides a local overview to the search results, showing basic metrics with the matched tweets.The visualization will be updated along with the user interaction. This part contains charts about thetweets, including histograms of number of likes, number of comments, and number of re-tweets.

83


Figure 5.7: Tweet Card

A tag cloud indicating the importance of the hash tags will show under the basic metrics. At thebottom is a graph showing user-tweet relationships. Based on how other teams would be able tograb the connection information, we can add a node-and-link visualization showing who postedwhich tweets.

Another set of questions involves how to manipulate that representation dynamically; theinteraction idiom controls how users change what they see. The main interactions we will use onour system are dynamic searching and filtering, brushing, and linking.

To make the tweet card flexible enough to add more components, we make an react-like functionto generate tweet card, that could generate different tweet cards based on different inputs. The democode for react-like tweet-card generation function is as follows.

function tweetCard(tweet) {

var cardName = '<div class="card-name">' +

tweet.id + '</div>';

var cardContainer = '<div class="card-container">' +

'<div class="card-avatar-wrap">' +

'<img src="./resources/img/twitter-512.png" alt="">' +

'</div>' +

'<div class="card-content">' +

84


'<div class="card-author-name">' +

tweet.user_name + '</div>' +

'<div class="card-author-screen-name">@' +

tweet.user_screen_name + '</div>' +

'<p>' + tweet.full_text_ner + '</p>' +

'</div>' +

'</div>';

var cardFooter = '<div class="card-name">' +

tweet.hash_tags + '</div>';

if (tweet.hasOwnProperty('post_time')) {

var cardTimer = '<div class="card-name">' +

tweet.post_time + '</div>';

return "<div class='tweet-card' id=" +

tweet.id + ">"

+ cardName

+ cardContainer

+ cardFooter

+ cardTimer

+ "</div>";

}

return "<div class='tweet-card' id=" + tweet.id + ">"

+ cardName

+ cardContainer

+ cardFooter

+ "</div>";

}

Social Network Graph

Another important dataset that could be visualized to users is the social network between tweetersand tweets. There are several relationships between them that we could visualize if available:follower, re-tweet, etc. Right now, we only used the basic at-mention relationship between tweeters:there is a link between two tweeters if one mentions the other one in tweets. The social networkgraph is developed using the basic force directed graph of D3.js (see Figure 5.8). In the graph,nodes represent real entities like tweeters or tweets. Other attributes of entities could be represented

85


through the size, color, or shape of nodes. For example, a blue circle represents a tweeter and a redrectangle represents a tweet. Links represent relationships between two entities. Other attributes ofrelationships, such as the number of comments between two tweeters, could be shown through thewidth of the link.

Figure 5.8: Social Network Graph

For the social graph, we aim to visualize the communication between people who talk about sometopic. The selection of topic is based on the keyword search and the range picking tools. Visualizingcommunication on a topic makes the user able to explore how people talk about the searched concept,because the users can infer key players in a conversation graph. In our implementation, the followingmethod is used to generate a social network:

1. For a search key, get the top 500 unique users who ever mentioned the key in their tweets (forhash tag searching and user name searching, search in the corresponding fields). The uniqueIDs are achieved through "group=true&group.field=user_id_str"

2. Then iterate through all the 500 unique IDs. For each ID, get all their at-mention user IDs.Iterate the user IDs and only select ones that are also in the 500 ID list.

3. For the IDs that have at least one connection, add them as nodes in the social graph.

86


4. Iterate through all connections and add unique connections as links in the graph.

Tweets by user categories through bar chart and pie chart

A histogram is an accurate graphical representation of the distribution of tweet data. For eachindividual attribute, such as number of followers, we can count the number of tweets posted bytweeters with different ranges of follower numbers. The ranges can be predefined based on userneed. When we perform the query, these ranges will be formatted into different facets in the Solrquery. Originally we thought showing the number of tweeters is a better choice. But it requires Solr6.0 or later to calculate unique values, while the current Solr is 4.10, which makes it impossible toimplement user count. The current Solr can only count the number of tweets within each field range.We will use these numbers to present tweet distributions by bar chart or pie chart (Figure 5.9).

Figure 5.9: Two charts for tweets by user categories

The user categories use two techniques provided by the Solr API. The facet method places thetweet documents into the bins while the group method aggregates the data by unique values. For thequery below:

var query;

query = global_constrain() + " AND " +

global_query(key) +

"&group=true&group.field=" +

field_user_id_str + "&group.facet=true" +

"&facet=true&" + facetQuery +

"&fl=" + attr +

"&rows=1&wt=json&indent=true" +

"&json.wrf=solr_get_user_info_callback";

87


"group=true" will group the returned results. "group.field=user_id_str" will group the results byunique user IDs. "facet=true" sets the returned results as faceted. "facet.field" indicates which fieldto facet. "facet.query" indicates the range of each bins. For example the bin of "the documents withfollower between 10 to 100" is "facet.query=user_follower_count:[10 TO 100]". Each query canhave multiple faceted queries at the same time. We can use the "facet_queries" in the returned JSONobject to retrieve the values:

facet_queries: {









}

These values can be added to the stacked bar chart and pie chart.

time-line

time-line is a visualization of tweet variance versus the time. The time-line should show numberof tweets posted at different times. An example time-line visualization (Figure 5.10) implementedwith D3 is the StreamGraph. The x axis (horizontal axis) is the time, evenly distributing the overalltime range selected by the time picker. The y axis (vertical axis) is the number of tweets. The colorencoding differentiates different terms in the search key. The width of the strip indicates the numberof tweets posted at different times.

Tag Cloud

To let users have an overview about current searching results, we provide a word cloud view (Figure5.11), a visual representation of text data through important keywords. In the view, tags are usuallysingle words, and the importance of each tag is shown with font size or color. Since we still don’thave an importance summary about key words from other teams, we are using the frequency ofwords shown in tweets to represent their importance. The color of the tags has the same meaning asthe key words in the tweet list: different colors means the word belongs to different NER (NamedEntity Recognizer) or POS (Part-Of-Speech tagging and POS tagger) categories. However, without

88


Figure 5.10: Stream graph with time series

support of tweet features from the tweet team, we only treat all the words that are "hash tags" withthe same color "purple". With more information about each keyword, much more reasonable tagcloud visualizations would be designed and implemented.

The tag cloud view is developed based on d3-cloud [32], a d3.js extended lib, Wordle-inspiredword cloud layout written in JavaScript. The tag could could be seamlessly integrated into ourframework through the similar d3.js syntax. Here is an example about the tag cloud codes similar tothe codes we implemented in our project:

var Canvas = require("canvas");

var cloud = require("../");

var words = ["Hello", "world", "normally", "you", "want", "more"]

.map(function(d) {

return {text: d, size: 10 + Math.random() * 90};

});

cloud().size([960, 500])

.canvas(function() { return new Canvas(1, 1); })

.words(words)

.padding(5)

.rotate(function() { return ~~(Math.random() * 2) * 90; })

.font("Impact")

89


.fontSize(function(d) { return d.size; })

.on("end", end)

.start();

function end(words) { console.log(JSON.stringify(words)); }

Figure 5.11: Tweets Tag Cloud

Geo-map

The Twitter geo-location data could be important for users, so we designed geo-map (Figures 5.12and 5.13) for tweet visualization. The location information could not only help them get an overviewabout the current searched event, but also let users make sure whether they have narrowed down to aspecific event. For example, a VT student could know their query has led to the results he/she wants,if returned tweets about a searched event "football game" located all results in Blacksburg in thegeo-map, in the state of VA instead of NY. There are two geo-map views on our project: the worldmap and the US map. Right now, two views are used, combined together, to let users gain insightsas to current tweet locations.

To make our map cooperate well with other views and users’ explorations, we provides themap views in several levels: countries layer, which shows the tweet distribution all over the world;country layer, which shows the tweet distribution all over the states or provinces; and state/province

90


layer, which shows the tweet distribution all over a specific state. The color depth represents thenumbers of tweets in each part: higher means more tweets, while lower means fewer tweets.

The map view is mainly based on d3.js, D3-geo layout, and TopoJSON [33]. D3-geo,thegeographic path generator, is similar to the shape generators in d3-shape: given a GeoJSONgeometry or feature object, it generates an SVG path data. TopoJSON is an extension of GeoJSONthat encodes topology. Rather than representing geometries discretely, geometries in TopoJSONfiles are stitched together from shared line segments called arcs.

Here is an example about the tag cloud codes similar to the codes we implemented in our project:

var svg = d3.select("svg");

var path = d3.geoPath();

d3.json("./us.json", function(error, us) {

if (error) throw error;

svg.append("g")

.attr("class", "states")

.selectAll("path")

.data(topojson.feature(us, us.objects.states).features)

.enter().append("path")

.attr("d", path);

svg.append("path")

.attr("class", "state-borders")

.attr("d", path(topojson.mesh(us, us.objects.states,

function(a, b) { return a !== b; })));

});

5.4.5 Solr API

The Solr API is a JS file containing a set of paired methods to form and send JSON queries. The send-ing is routed through PHPRouter.js and then PHPRouter.php. Each method pair corresponds to oneor two visualization views. For example, for the tweet card view and word cloud, two functions aredefined: solr_get_new_tweet_list and solr_get_search_callback. solr_get_new_tweet_list takes theparameter of searching term(s). Variable "query" contains the query string with global constraint andquery, and the row count and data type. In the query object, "json.wrf=solr_get_search_callback" in-dicates which method will be called. When Solr returns the JSON objects, "solr_get_search_callback"will be called, with the parameter being the JSON string. The query_solr method sends query pa-rameters to different locations depending on the Boolean value "debugMode". For debugging

91


Figure 5.12: Geo-location of Tweets: Countries and States

Figure 5.13: Geo-location of Tweets: State

(when debugMode=true), we use port tunnel and query "localhost". For deployment (when debug-Mode=false), we use the PHP URL to get a JSON object. PHP URLs are stored in PHPRouter.js. Inthe solr_get_search_callback method, we decode the JSON and extract information needed by thevisualization. "addSolrDataToCardView" and "addSolrDataToWordCloud" send the extracted datato the corresponding visualization class (D3 js).

92


//query the solr to retrieve search results

//for card view and word cloud

function solr_get_new_tweet_list(key) {

var query =

global_constrain() +

" AND " + global_query(key) +

"&rows=100&wt=json&indent=true&" +

"json.wrf=solr_get_search_callback";

query_solr(query);

}

//callback for retrieving search results list

function solr_get_search_callback(response) {

var json = String(JSON.stringify(response));

var root = JSON.parse(json);

var rowRoot = root.response.docs;

var tweetList = [];

for (var i = 0; i < rowRoot.length; i++) {

var tweet = Tweet(rowRoot[i]);

tweetList.push(tweet);

}

var keyList = {};

tweetList.forEach(function (tweet) {

var content = tweet.full_text_ner;

content.forEach(function (keyword) {

if (keyList.hasOwnProperty(keyword)) {

keyList[keyword] += 1;

} else {

keyList[keyword] = 1;

}

});

});

spaceTags(Object.keys(keyList));

addSolrDataToCardView(tweetList);//return to the CardView

93


addSolrDataToWordCloud(keyList);

}

5.4.6 PHP Router

Two router files written in PHP are used in our project to communicate with the Solr databaseand Jena database. PHPRouter.php queries the Solr REST API through the curl method. The curlapproach uses the "GET" method to grab the HTTP request from the Solr server. The "echo" methodprints the results in the "success function" in PHPRourter.js. As we specified the callback method inSolr API, the "result" object will be the string containing the method name. The parameters willalso be automatically printed to pass the JSON object to the method.

//PHPRouter.php

<?php

$param = $_GET['query'];

$solrURL = "http://10.0.0.125:8983

/solr/getar-cs5604f17_shard1_replica1/select?q=";

$curl = curl_init();

curl_setopt($curl, CURLOPT_URL, $solrURL . $param);

curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

curl_setopt($curl, CURLOPT_CUSTOMREQUEST, "GET");

curl_setopt($curl, CURLOPT_USERAGENT,

"mozilla/5.0 (ipad; cpu os 7_0_4 like mac os x)

applewebkit/537.51.1 (khtml, like gecko)

version/7.0 mobile/11b554a safari/9537.53");

$result = curl_exec($curl);

curl_close($curl);

echo $result;

}

PHPSocialNetwork_Jena.php performs similar functions as PHPRouter, but it gets the JSON objectfrom the Jena API, and the executions are in parallel. Instead of using curl_exec, PHPSocial-Network_Jena.php uses the curl_multi_getcontent method to get the connections. PHPSocialNet-work_Jena.php works as follow:

1. Based on the Solr query sent by SOLAPI.js, query the Solr API to get the unique user IDs.

94


2. For each unique ID, form RDF query URLs to retrieve all their mentions. This step is the"map" step of the Map-Reduce structure. It creates all the URLs and stores them in an array.All URLs in the array are sent to the Jena server at the same time. This for loop does the task:

//PHPSocialNetwork_Jena.php

<?php

foreach ($ids as $id => $value) {

$q = "prefix sub: <http://example.org/> " .

"prefix pred:<http://xmlns.com/SNR/0.1/> " .

"SELECT ?" . $id . " WHERE {?" . $id . " " .

"pred:mentions \"" . $id . "\"} LIMIT 20000";

array_push($urls, $jenaURL . encodeURIComponent($q));

}

3. multiCurlJena method sends all URLs in parallel, and collects returned JSON objects. Thisstep is the reduce method to associate all the returned results.

4. Iterate through all connections, and create the data needed by the Social Network graph as inSocialNetworkGraph.js.

95

Chapter 6

Acknowledgement

Thanks go to the US National Science Foundation for supporting Global Event and Trend ArchiveResearch (GETAR) through IIS-1619028. We are thankful to our Professor, Edward Fox, and theGraduate Teaching Assistant, Liuqing Li, who provided expertise that greatly assisted the research.

We are also grateful to Abhinav K. for assistance with connection with Solr even during theThanksgiving break, and to the SOLR team members and its leader, Jeff Robertson, who showedgreat communication skill and good teamwork with us.

We have to express out appreciation to the Twitter team for sharing their pearls of wisdom withus during the course of this research. We are also immensely grateful to FE team of Spring 2017 fortheir research, paper and comments on an earlier versions of previous FE team work.

Finally, we would like to thank all of the team members of CS5604 in Fall 2017 for their efforts.They supported our work throughout the semester and helped us get results of better quality.

Appendix A

Installation Links

A.1 Install UbuntuOS

https://tutorials.ubuntu.com/tutorial/tutorial-install-ubuntu-desktop?

_ga=2.22244671.582714899.1506203717-2080495813.1498138177#11

A.2 Install Ubuntu 16.04 LTS on VirtualBox/Windows 10

https://www.youtube.com/watch?v=DPIPC25xzUM

A.3 Set Up for Ruby On Rails on Ubuntu 16.04

https://gorails.com/setup/ubuntu/16.04

A.4 Vim Installation

https://www.youtube.com/watch?v=iIS8wNNNCPY

A.5 Vim Commands

http://www.radford.edu/~mhtay/CPSC120/VIM_Editor_Commands.htm

https://tutorials.ubuntu.com/tutorial/tutorial-install-ubuntu-desktop?_ga=2.22244671.582714899.1506203717-2080495813.1498138177#11

https://tutorials.ubuntu.com/tutorial/tutorial-install-ubuntu-desktop?_ga=2.22244671.582714899.1506203717-2080495813.1498138177#11

https://www.youtube.com/watch?v=DPIPC25xzUM

https://gorails.com/setup/ubuntu/16.04

https://www.youtube.com/watch?v=iIS8wNNNCPY

http://www.radford.edu/~mhtay/CPSC120/VIM_Editor_Commands.htm

Appendix A. Installation Links 98

A.6 Vim for Editing Files in Linux

https://www.youtube.com/watch?v=ImK_dHPOTIE

A.7 Download Solr sample data

https://gist.Githubusercontent.com/mejackreed/84abc598927c43af665b/raw/geoblacklight-documents.json

98

https://www.youtube.com/watch?v=ImK_dHPOTIE

Appendix B

Error Solutions

B.1 SQL Error Solutions

B.1.1 Access denied for user ‘root’@‘localhost’

For this error, you can solve this by using the following commands. [23]

1 $ sudo mysql −r root2

3 mysql> USE mysql4 mysql> FLUSH PRIVILLEGES;5 mysql> exit;6

7 $sudo mysql restart

B.2 GeoBlacklight Error Solutions

B.2.1 core already exists

For this error, clean Solr:

1 $ rake solr:clean

Appendix C

Other Resources

C.1 GeoBlacklight Schema GitHub

https://Github.com/geoblacklight/geoblacklight/wiki/Schema

https://Github.com/geoblacklight/geoblacklight/wiki/Schema

Bibliography

[1] Pablo Aragón, Karolin Eva Kappler, Andreas Kaltenbrunner, David Laniado, and YanaVolkovich. 2013. Communication dynamics in Twitter during political campaigns: Thecase of the 2011 Spanish national election. Policy & Internet 5, 2: 183–206. doi:https://doi.org/10.1002/1944-2866.POI327

[2] Sandjai Bhulai, Peter Kampstra, Lidewij Kooiman, Ger Koole, Marijn Deurloo, and BertKok. 2012. Trend Visualization on Twitter: What’s Hot and What’s Not? In 1st InternationalConference on Data Analytics, 43–48.

[3] David Carmel, Erel Uziel, Ido Guy, Yosi Mass, and Haggai Roitman. 2012. Folksonomy-BasedTerm Extraction for Word Cloud Generation. ACM Trans. Intell. Syst. Technol. 3, 4: 60:1–60:20.doi: https://doi.org/10.1145/2337542.2337545

[4] Adrien Guille and Cécile Favre. 2015. Event detection, tracking, and visualization in Twitter: amention-anomaly-based approach. Social Network Analysis and Mining 5, 1: 18.

[5] Lars Lischke, Jan Hoffmann, Robert Krüger, Patrick Bader, Paweł W. Wozniak, and Al-brecht Schmidt. 2017. Towards Interaction Techniques for Social Media Data Explorationon Large High-Resolution Displays. In Proceedings of the 2017 CHI Conference Ex-tended Abstracts on Human Factors in Computing Systems (CHI EA ’17), 2752–2759. doi:https://doi.org/10.1145/3027063.3053229

[6] Brett Meyer, Kevin Bryan, Yamara Santos, and Beomjin Kim. 2011. TwitterReporter: BreakingNews Detection and Visualization through the Geo-Tagged Twitter Network. In CATA, 84–89.

Bibliography 102

[7] Teresa Onorati, Petra Isenberg, Anastasia Bezerianos, Emmanuel Pietriga, Paloma Diaz.WallTweet: A Knowledge Ecosystem for Supporting Situation Awareness. ITS Workshopon Data Exploration for Interactive Surfaces (DEXIS). 2015.

[8] Broder, Andrei Z., Nadav Eiron, Marcus Fontoura, Michael Herscovici, Ronny Lempel, JohnMcPherson, Runping Qi, Springer, and Eugene Shekita. 2006. “Indexing Shared Content inInformation Retrieval Systems.” Lecture Notes in Computer Science, Springer, no. 3896: 313.

[9] Fox, Edward A. 1980. “Lexical Relations: Enhancing Effectiveness of Information RetrievalSystems.” ACM SIGIR Forum 15 (3): 5–36. doi:10.1145/1095403.1095404.

[10] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction toInformation Retrieval. New York: Cambridge University Press. https://nlp.stanford.edu/IR-book/.

[11] Michael K. Buckland, and Christian Plaunt. 1994. “On the Construction of Selection Systems.”Library Hi Tech 12 (4): 15–28. doi:10.1108/eb047934.

[12] Goulding, Christina. Grounded theory: A practical guide for management, business and marketresearchers. Sage, 2002.

[13] Tamara Munzner. Visualization Analysis and Design. A K Peters Visualization Series, CRCPress, 2014.

[14] Ben Shneiderman. “The Eyes Have It: A Task by Data Type Taxonomy for InformationVisualizations.” In Proceedings of the IEEE Conference on Visual Languages, pp. 336–343.IEEE Computer Society, 1996.

[15] Bostock M, Ogievetsky V, Heer J. e3 data-driven documents. IEEE transactions on visualizationand computer graphics, 2011, 17(12): 2301-2309.

[16] Adam Marcus, Michael S. Bernstein, Osama Badar, David R. Karger, Samuel Madden, andRobert C. Miller. 2011. Twitinfo: aggregating and visualizing microblogs for event exploration.In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’11).ACM, New York, NY, USA, 227-236. DOI: https://doi.org/10.1145/1978942.1978975

[17] Marcos André Gonçalves, Edward A. Fox, Layne T. Watson, and Neill A. Kipp. 2004. Streams,structures, spaces, scenarios, societies (5S): A formal model for digital libraries. ACM Trans.Inf. Syst. 22, 2 (April 2004), 270-312. DOI=http://dx.doi.org/10.1145/984321.984325

[18] Marcia J. Bates, (1989) "The design of browsing and berrypicking techniques for the onlinesearch interface", Online Review, Vol. 13 Issue: 5, pp.407-424, https://doi.org/10.1108/eb024320

102

Bibliography 103

[19] Zhou, Yuchao et al. “Search Techniques for the Web of Things: A Taxonomy and Survey.” Ed.Yunchuan Sun, Antonio Jara, and Shengling Wang. Sensors (Basel, Switzerland) 16.5 (2016):600. PMC. Web. 20 Dec. 2017.

[20] Nadia Boukhelifa, Jonathan C. Roberts and Peter J. Rodgers, "A coordination model forexploratory multiview visualization," Proceedings International Conference on Coordinatedand Multiple Views in Exploratory Visualization - CMV 2003 -, 2003, pp. 76-85. doi:10.1109/CMV.2003.1215005

[21] David Carmel, Erel Uziel, Ido Guy, Yosi Mass, and Haggai Roitman. 2012. Folksonomy-BasedTerm Extraction for Word Cloud Generation. ACM Trans. Intell. Syst. Technol. 3, 4, Article 60(September 2012), 20 pages. DOI=http://dx.doi.org/10.1145/2337542.2337545

[22] Osbourn, Toby. “What Is a Gemfile.” Tosbourn Ltd. – London Based Ruby, JavaScript, andElixir Developers, 18 July 2015, tosbourn.com/what-is-the-gemfile/. Accessed 15 Oct. 2017.

[23] "ERROR 1698 (28000): Access Denied for User ’root’@’localhost’." Mysql - ERROR 1698(28000): Access Denied for User ’root’@’localhost’ - Stack Overflow. https://stackoverflow.com,13 Mar. 2017. Accessed 15 Oct. 2017.

[24] Reed, Jack. Index Solr Documents - Part 4 - GeoBlacklight Workshop. http://geoblacklight.org,9 Feb. 2015. Accessed 15 Oct. 2017.

[25] Events Archive Invitation, Funding. (2017). Retrieved December 20, 2017, fromhttp://www.eventsarchive.org/

[26] Solr. http://lucene.apache.org/solr/, Accessed 12 Dec. 2017

[27] Jena. https://jena.apache.org/, Accessed 12 Dec. 2017

[28] DLRL Hadoop cluster. http://hadoop.dlib.vt.edu/, Accessed 4 Dec. 2017

[29] Bostock, M. (n.d.). Data-Driven Documents. Retrieved December 20, 2017, fromhttp://d3js.org/

[30] Mark Otto, Jacob Thornton, and Bootstrap contributors. (n.d.). Bootstrap. Retrieved December20, 2017, from https://getbootstrap.com/

[31] Jquery.org, J. F. (n.d.). Retrieved December 20, 2017, from https://jquery.com/

[32] J. (2017, January 12). Jasondavies/d3-cloud. Retrieved December 20, 2017, fromhttps://Github.com/jasondavies/d3-cloud

103

Bibliography 104

[33] TopoJSON. (n.d.). Retrieved December 20, 2017, from https://Github.com/topojson

[34] Resource Description Framework (RDF): Concepts and Abstract Syntax (n.d.). RetrievedDecember 23, 2017, from https://www.w3.org/TR/rdf-concepts/

104

Date post:	27-Apr-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times