PHS 398 (Rev. 9/04), Continuation...

Research Plan

a. Identification and Significance of the Problem or OpportunityGiven a large set of documents such as grant applications or scientific publications, how does one quickly gain an understanding of the key information they contain? This problem is faced by medical researchers, lawmakers and administrators alike. Portfolio analysis tools at NIH are capable of viewing patterns of grant proposals and performing statistical analyses to understand trends and other analytical metrics. We propose to construct a system that allows analysts, administrators and policy-makers to access and view the entire research portfolio as a large-scale, simple, intuitive map where they may zoom in to the level of individual grants and then zoom out to the level of whole institutes.

During the 2007 fiscal year, NIH’s budget was $29.128 billion (see http://report.nih.gov/). For 2007 alone, over 60,000 individual project grants are listed in the CRISP database (which includes government funding agencies additional to NIH) (http://crisp.cit.nih.gov/) ranging from basic research into the mechanisms of cancer to new teaching courses. The ability to understand where public funding is being spent is a vital element of oversight within the government, but is complicated by the fact that each grant must be classified based on a hand-crafted scheme. This has lead to the development of the Research, Condition and Disease Categorization (RCDC, http://rcdc.nih.gov/), a comprehensive breakdown of different categories of the most important fields being funded. RCDC is both important and valuable, and is strictly defined as 215 high-level categories to be used to classify NIH spending. Although some subjects (like ‘cancer’) command billions of dollars of funding and others receive far less, each high-level topic label conceals a wealth of complexity about underlying

Figure 1: Screenshot mockup of proposed application. This web-based system provides a navigable map of a document collection.

http://rcdc.nih.gov/

http://crisp.cit.nih.gov/

http://report.nih.gov/

research trends and topics in the field. We will build tools to analyze and visualize this complexity in a way that is intuitive, accessible, quantitative and scalable. Figure 1 shows a screenshot from the application we will build: a web-based system that provides a simple navigable map of a document collection. This mockup is based on an existing prototype (see §c) and is only one possible product of our technology. In order to enable users to interact with this application using an ordinary Web-browser, the application will generate parts of the visualization on the server as static images and augment them with both additional metadata and quantitative metrics using the browser’s capabilities.

b. Technical Objectives We will build a prototype web-application that provides definitive proof-of-concept for the vision described above. This system will be able to ingest thousands to millions of documents and generate scalable maps that may be explored using any browser (see Figure 2). Our application will permit users to explore and understand the structure, content, and hidden relationships in many different types of documents such as: in-process applications, previously-funded grants, review papers, primary research articles, medical records, MEDLINE abstracts, patents, web-pages, etc.

Our application will permit users to explore and understand the structure, content, and hidden relationships in many different relevant classes of documents. Our specific objectives are: a. Research and develop a robust, intuitive user interface that enables a user to generate

meaningful visualizationsb. Research and characterize effective algorithms and data processing frameworks for visualizing

biomedical research data in network formatc. Develop visualization prototype web-application and Application Programming Interface (API)d. Review the strengths and weaknesses of the prototype needing improvement in Phase II

Figure 1: Diagram shows high-level technical architecture of the proposed application.

Phase I of this project will conclude with a fully functional prototype web application that includes a robust user interface designed and based on information gathered from NIH researchers and portfolio analysts that can be used to visualize and analyze biomedical research data. In addition, a relational database and xml data model, an Application Programming Interface (API), and a report generated from the feedback of these NIH analysts and researchers regarding the strengths and deficiencies of the prototype will be available.

c. Work/Research Plan1. Research the design of a User Interface (UI) that allows a user to select and view data of interest using derived parametric data. In the first stage, we will gather information about users and their tasks in order to design an effective user interface to support exploration of biomedical research data. Procedure:

1. Interview four to six NIH researchers and policy analysts whose job responsibilities involve the 'areas of interest' in the topic announcement in order to gain a deeper understanding of their problems and needs.

2. Generate Use Cases from these interviews that describe specific ways of using the visualization tool to understand biomedical research or grant portfolios.

3. Survey current literature dealing with human-computer interaction and graphic design to identify specific guidelines that will be incorporated into the design of the user interface for this tool.

4. Design user interface (UI) mock-ups that encompass the needs found during the interview process and refine them iteratively in cooperation with said researchers.

The outcome of this stage will be: (1) a set of Use Cases defining the user requirements; (2) a set of UI mockups as a result of step 4; and (3) technology recommendations for implementing such a UI. The benchmark needed to move on to the next stage will be agreement between NIH researchers and project personnel that the generated UI mock-ups sufficiently demonstrate the manner in which the user interface will fulfill the Use Cases generated in step 2. The ChalkLabs team consisting of Shashikant Penumarthy, Bruce Herr, and Gavin LaRowe in conjunction with Dr. Burns will be responsible for executing this task. Work will be performed at ChalkLabs’ offices in Indiana using the internal development infrastructure as well as on site at the offices of the NIH researchers. ChalkLabs personnel will travel to the offices of NIH researchers and observe them at work to understand the environment in which they work, the tools they use and other specific issues relevant to the design of the software application.2. Design and develop a data model that can be used to store, query, and transform biomedical research data in order to generate visualizations (Weeks 5-8, Location: ChalkLabs, Bloomington, IN): Since the data model of this system will be network-based, it must be both general enough to handle variations in structure and efficient enough to support the kinds of analysis needed to support visualization. Procedure:

1. Survey biomedical research data sets such as grant applications to gain an understanding of commonly used data formats and attributes of those documents that may be important for exploring and understanding how they are related to each other.

2. Survey commonly used analysis algorithms that build networks from unstructured data. Although Phase I will only use the Topic Model, this step is important as it will provide knowledge needed to encompass other algorithms for analysis in the future.

3. Create a normalized conceptual data model that can represent the different types of networks and their attributes. This data model will subsequently be used to determine an initial Application Programming Interface (API) for Phase II.

4. Test the expressiveness and efficiency of the data model by loading test data into a database and performing frequently used network analysis and visualization operations on it.

The outcome of this stage will be both an XML and relational database model for representing biomedical research networks. The benchmark will be that common network operations performed in step 5 are easily expressed in terms of the data model developed. The ChalkLabs team consisting of Shashikant Penumarthy, Bruce Herr, and Gavin LaRowe will be responsible for executing this task. All work will be performed at ChalkLabs’ offices in Indiana using the internal development infrastructure. 3. Research and develop a visualization and interaction model.The ‘visualization model’ defines the visualization design space by determining available primitives such as shapes and the type as well as the properties acting on those shapes (such as color and size). It is independent of the data model, but is designed so that its most important features can be visualized without resorting to complex transformations. The interaction model is then layered on top of the visualization model and defines the application’s response to user actions. Both the visualization and interaction models evolve to meet the changing needs of the data and users. Procedure:

1. Survey existing designs of network visualizations in scientific publications, visualization applications and visualization weblogs.

2. Identify visual primitives required to visualize networks based on the database schema developed in stage 1 and research performed in stage 2.

3. Design a visualization model to accommodate variations in network structure and visual attributes while incorporating the recommendations formulated as a result of UI research performed in stage 1.

4. Design an interaction model that incorporates ideas from existing interaction metaphors as well as introduces new ones based on their perceived ease of use and feasibility of implementation.

The outcome of this stage will be a description of objects in the visualization system and their interactions that can be translated into source code by computer programmers. The benchmark needed to move on to the next stage will be that the visualization and interaction models integrate very well with the user interface designed in stage 1. The ChalkLabs team consisting of Shashikant Penumarthy, Bruce Herr, and Gavin LaRowe will be responsible for executing this task. All work will be performed at ChalkLabs’ offices in Indiana using the internal development infrastructure. 4. Integrate the Topic Model algorithm to analyze and generate similarity networks for visualization. Topic Models are unsupervised models for generating topics from document collections (Griffiths and Steyvers, 2004), which may then be processed to provide visualizations. topicSeek LLC already has an efficient implementation of the Topic Model that will be used as part of the visualization tool. Procedure:

1. Identify the requirements for running Topic Model on a database and modify the programming interfaces appropriately.

2. Implement the necessary data converters to feed data to the algorithm from the database and store results of the analysis back into the database.

3. Test and optimize the implementation using data sets of varying sizes to determine deficiencies and strengths.

The outcome of this stage will be a working implementation of the Topic Model that can use the efficient data model developed during stage 2 to perform text analysis. Dr. Newman will be responsible for developing the code for the Topic Modeling algorithm as well as the topics and similarity networks to be used for the subsequent visualizations. The benchmark needed to move the next step will be that the model correctly analyzes thousands to millions of documents to produce topics that make sense. The ChalkLabs team consisting of Shashikant Penumarthy, Bruce Herr, and Gavin LaRowe will be responsible for integrating Dr. Newman’s work product into the visualization platform. Dr. Newman will perform his work on an internal development environment at topicSeek’s offices in California. Work performed by the ChalkLabs team will be performed at ChalkLabs’ offices in Indiana using the internal development infrastructure.

5. Research and develop the server-side analysis and rendering pipeline.Here, we will automate and streamline the process of generating new visualizations. It will become possible to pipe many different data sets through the system, analyze and visualize them without the need for human

intervention. Such automation will enable the creation of a “live” dashboard showing a map of, for example, the entire space of cancer research updated periodically as new data becomes available. Procedure:

1. Identify the data format of inputs and outputs expected by each stage of the pipeline shown in Figure 2.

2. Design programmatic interfaces between stages of the pipeline for 2-way transmission of data, command parameters and status messages and develop test cases to determine conformance of the implementation to interfaces specified in step 1.

3. Create programs that implement the interfaces designed in step 2 and create tools to manage the flow of data through the pipeline.

The outcome of this stage will be a software implementation of a server-side rendering pipeline that will accept data in flat files or databases and set of parameters as input and generate a set of static tiled images corresponding to the visualization as output. The benchmark needed to move on to the next stage will be that the pipeline be able to run from start to finish in a completely automated manner. The ChalkLabs team consisting of Shashikant Penumarthy, Bruce Herr, and Gavin LaRowe will be responsible for executing this task. All work will be performed at ChalkLabs’ offices in Indiana using the internal development infrastructure. 6. Develop prototype web-application and Application Programming Interface (API).At this stage, a proof-of-concept web-application that runs in the browser will be built to demonstrate the desired components and behavior as specified during stages #2 and #4. Procedure:

1. Identify life-cycle events that orchestrate the creation, update and deletion of objects that constitute a running prototypical web-based application.

2. Architect interfaces between client and server with respect to data, meta-data, command parameters, results and error responses while adhering to established standards in web-application development.

3. Develop test cases that will be used to determine the conformance of the implementation to architecture developed in stage 1.

4. Iteratively develop and test the client-side prototype. 5. Develop an Application Programming Interface (API) that can used for future development.

The outcome of this stage will be a fully functional web-based prototype that can be used to explore and understand large document collections. The working software itself will be the benchmark that must be met in order to move on to the next stage. The ChalkLabs team consisting of Shashikant Penumarthy, Bruce Herr, and Gavin LaRowe will be responsible for executing this task. All work will be performed at ChalkLabs’ offices in Indiana using the internal development infrastructure and Amazon’s EC2 & S3 cloud computing infrastructure.7. Evaluate the usability of the prototype in the context of the Use Cases developed during stage 1.At this stage, ChalkLabs will test the application with the researchers interviewed during stage 1 to evaluate how well the software satisfied the Use Cases they helped to develop. This serves several purposes: (1) it allows the researchers to revisit the application and make recommendations regarding whether certain features they deemed important were, in fact, useful or if a different approach is needed; (2) it provides valuable feedback about the software to the builders of the tool; and (3) it may suggest other topics needing further exploration during the Phase II of the development of this tool. Procedure:

1. Interview a few researchers and prepare an informal usability study comprising of tasks that researchers should be able to perform with this tool.2. Allow researchers to use the tool for the tasks specified using pre-loaded data and make observations on all aspects of the usage of the tool.

The outcome of this stage will be a report describing what worked well, what didn’t work, and recommendations for improvements to be addressed during Phase II. Dr. Burns, Dr. Newman, and the

ChalkLabs team consisting of Shashikant Penumarthy, Bruce Herr, and Gavin LaRowe will be responsible for executing this task. All work will be performed at the respective offices of all parties. As an Indiana-based company, ChalkLabs is eligible for a dollar for dollar match on all SBIR Phase I awards from the Indiana Economic Development Corporation’s 21st Century Fund. These funds, up to $100,000, will be used to further the objectives outlined in the above work plan. A letter of support is attached.

ObjectiveMonths

Responsible Party Benchmarks1 2 3 4 5 6

R&D for intuitive User Interface (step 1)

Dr. Burns, ChalkLabs

UI mockups sufficiently capture user requirements.

R&D for algorithms and data frameworks (steps 2-4)

Dr. Newman, ChalkLabs

Topics model integrates with the data model and visualization and interaction model integrate well with the UI design.

Prototype and API development (steps 5-6)

ChalkLabs Software works as per specification and produces meaningful visualizations.

Review strengths & weaknesses (step 7)

Dr. Burns, ChalkLabs

Specific problems and possible solutions are identified for Phase II.

Table 1: Timeline and responsibility of actions related to objectives.

d. Related ResearchTopic ModelingIn the so-called ‘topic model’, a topic is a multinomial probability distribution over the W unique words in the vocabulary, in essence a W-sided die which we roll to generate words. Thus, each topic t=1:T is a multinomial probability vector, p(w|t), and there are T topics in total. A document is represented as a finite mixture of the T topics. Each document d=1:D is assumed to have its own set of mixture coefficients, p(t|d), another multinomial probability vector, but over topics. Thus, a randomly selected word from document d has a conditional distribution p(w|d) that is a mixture over topics, where each topic is a multinomial over words: p(w|d) = Sum_t p(w|t) p(t|d). Therefore, the model operates by treating each document simply as a probabilistic collection of topics, where each topic is simply a probabilitistic collection of words. This simple approach means that the time required to compute the topic model is linear in the number of documents, D, and the number of topics, T. David Newman has demonstrated the scalability of topic modeling by showing how the learning algorithm can be distributed onto a parallel computer. Using the above algorithm, the team computed topic models for the entire PubMed collection (containing D = 8.2 million abstracts) in a period of only a few days.

Large-Scale Network VisualizationRecent research on large-scale network visualization includes efforts to enhance information density (Bederson, Shneiderman & Wattenberg, 2002), improve space utilization (van Ham & van Wijk, 2003) and improve exploration (Plaisant, Grosjean, & Bederson, 2002). Several ways of laying out networks have been developed (Di Battista, Eades, Tamassia & Tollis, 1998), but force-directed placement methods like the Fruchterman-Reingold (Fruchterman & Reingold, 1991) algorithm are used most often. An extensive review of network analysis and visualization can be found in (Borner, Sanyal & Vespignani, 2007). Indeed, there are many tools that offer the capability to visualize networks, but only a few are designed to handle networks with more than a few thousand nodes. Some examples are: Pajek (Batagelj & Mrvar, 1998); Large Graph Layout or LGL (Adai, Date, Wieland & Marcotte, 2004); the Boost Graph Library (Siek, Lee &

Lumsdaine, 2002); GraphViz (Ellson, Gansner, Koutsofios, North & Woodhull, 2003); LaNet-vi (Alvarez-Hamelin, Dall'Asta, Barrat & Vespignani, 2005) This is only a small list of tools that offer network visualization and it is impossible to list them exhaustively. Several online resources (Moere, 2007) (Lima, 2007) are dedicated to tracking, categorizing and critiquing new visualizations as they emerge. vxInsight (Davidson, Hendrickson, Johnson, Meyers & Wylie, 1998) and subsequently the DrL system (Martin et al 2007) is able to visualize network relations in large collections of unstructured documents and was the system we used to generate the views shown in our preliminary proofs of concept.

Existing proofs-of-conceptThe proposed work is partially based on existing prototypes that provide a proof-of-concept of both the output functionality and the feasibility of the underlying computational techniques. At present, our academic partners have generated three separate topic-map web applications, currently hosted at the SciMaps.org, ‘Places and Spaces: Mapping Science’ website, courtesy of Katy Börner’s InfoVis lab at Indiana University. These initial systems were constructed as an academic collaboration between Dr. Burns, Dr. Newman, and Mr. Herr and now provide a functional demonstration of the utility of this approach. Each separate application was developed using approximately 33% of each participant’s time over three months (initially under internal funding from ISI and then under a renewable contract from NINDS). These systems currently provide a proof-of-concept application for

members of the scientific community as well as providing a practical demonstration of our end-goals for the purposes of this proposal. The systems are:

http://scimaps.org/maps/neurovis/: The first application was based on providing a single view of the Society for Neuroscience (SfN) annual meeting in Atlanta. We mapped 12,000 abstracts from the 2006 meeting and presented it at the 2007 conference [Burns et al., 2007].

http://scimaps.org/maps/ninds/: We were approached by Dr. Edmund Talley of NINDS to construct a similar interface designed to provide a simple map of CRISP data for public use. We analyzed NINDS’s 2006 portfolio in relation to CRISP data from 14 neuroscience-related institutes.

http://scimaps.org/maps/2007/nih/: We continued the analysis of CRISP data to tackle a more challenging problem: to develop a topic-map application for all NIH proposals for 2007.

The utility of these tools are shown in Fig. 3: By zooming into our current pre-release prototype, we reveal the relationships between a set of clusters, labeled by terms relating to blood: ‘Hemodynamics’, ‘NOS signalling’, ‘Cardiac Failure’, etc. A simple visual inspection shows that clusters of grants relating to neurological implications of blood-related disorders (Ischemia and Stroke) are funded by NINDS and the surrounding grants are funded by NLHLBI. Even though this functionality is currently available from these applications, it still requires the local expertise of our team to run and debug the procedure to analyze the text, render the maps, place the labels, etc. The degree of automation in our current infrastructure is

Figure 3: Screenshot of the development prototype to demonstrate navigation of grant abstracts showing relationship between NHLBI and NINDS grants relating to blood.

http://scimaps.org/maps/2007/nihninds/

http://scimaps.org/maps/ninds/

http://scimaps.org/maps/neurovis/

not sufficient to allow other people to build their own mapping applications. Our SBIR application is specifically geared to generating an environment where NIH staff members can generate maps of grants that have not yet been funded and are therefore highly confidential documents. This requires a system that can be deployed directly behind NIH’s firewall without the need for specialized analysts to run the system. In order to make this a viable reality, we must develop our toolset from their current status as a set of research prototypes to a fully engineered software application. This vital development work would be best undertaken within a commercial environment.

e. Relationship with Future R&DThe outcome of Phase I will be a functional web-based prototype that utilizes text analysis, an innovative tile-based visualization system and hybrid client-server rendering to provide interactive visualizations of large scale networks. Such a tool does not exist today and it is expected that if successful, this prototype will be a big step towards democratizing large-scale visualization. The prototype tool developed in Phase I will be limited to only one analysis algorithm (Topic Modeling), one type of layout and pre-defined sets of visual attribute mappings. The prototype will require significant development work before new kinds of analyses, transformations, mappings and visualizations can be incorporated into the tool. Such a collection of tools and methods cannot be developed by a single entity, but must be the outcome of involvement of the research community. Therefore, the tool must provide the ability to plug-in different components into the system and access them in a unified manner. Deploying such a tool in a production environment will also require testing for performance and resilience. The Phase I plan will touch upon every aspect of the research, design, development and deployment process, thus providing the tools, knowledge and experience needed to tackle the much larger problem of commoditizing a large-scale visualization tool for the use of the general public.In Phase II, the ChalkLabs team will take advantage of distributed and cloud-computing technologies and the emerging web technologies to create a highly interactive user-centric web application that will enable end users to upload their own data, perform many kinds of analyses, experiment with many kinds of layout, customize their visualization to an unprecedented degree and collaborate with others over the web. ChalkLabs’ personnel have a unique set of skills that is extremely well suited to building such an application. The experience gained during Phase I will therefore be invaluable towards building a rich, interactive large-scale visualization tool.

f. Potential Commercial ApplicationsBeing able to build intuitive maps of large-scale textual corpora is a universal application extending far beyond the niche market provided by NIH program officers and members of the public for funded grants from NIH. We will approach companies that house and distribute large corpora of scientific text and provide them with a novel way of allowing their users to view, navigate and analyze their assets using our system. Potential customers include

a) NIH and NSF administrators need assistance managing their portfolios. This is challenge that goes beyond the current constraints of our system due to the size and turnaround of unfunded proposals (usually at least five times larger that funded proposals).

b) Scientific publishers, such as Elsevier science, Wiley, Academic Press and other large-scale scientific publishers to examine the possibility of providing maps of their internal inventories.

c) Organizers of scientific conferences must usually categorize and organize large numbers of presentations in a short space of time between the submissions and the preparation of a program. Our system could alleviate the burden on reviewers.

d) Digital scientific libraries of specific universities may need to catalogue and organize their local holdings.

e) Other text repositories, involving email, Web pages or miscellaneous business documents would also benefit from using the tool to generate rapid visual indexes of their collections.

Two main barriers to entry exist for this market. First the use of network analysis and network visualization is not in widespread in the commercial sector because it is a relatively nascent field. Second, the scope of technical processes and expertise needed to take advantage of network analytic techniques is vast and includes network analysis, statistics, and information visualization. We believe the web-based nature of the tool will greatly reduce the resistance to adoption of this tool because no cumbersome download and installation will be required. Other strategies for overcoming the aforementioned barriers include:

a. Creating embeddable interactive views of large text corpora from existing websites such as Wikipedia or develop plug-ins for popular sites such as YouTube or Google map.

b. Offering free interactive dashboards that periodically update their view based on data.c. Providing free licenses within the academic community to spread awareness about large-scale

network visualization as a means for analyzing document collections.

g. Key Personnel and Bibliography of Directly Related Work.

NAMEShashikant Penumarthy

POSITION TITLEInformation Visualization Research ScientistChalkLabs, Bloomington, IndianaeRA COMMONS USER NAME

shashikantpEDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.)

INSTITUTION AND LOCATION DEGREE(if applicable) YEAR(s) FIELD OF STUDY

University of Mumbai B.E. 2002 Electrical EngineeringIndiana University M.S. 2004 Computer ScienceIndiana University Ph.D. 2009 (antic.) Information Visualization

A. Positions and Honors

Professional Experience

Fall 2008-present Research Scientist, ChalkLabs, Bloomington, INSummer 2007-Fall 2008 Visualization Consultant, Mind Alliance Systems, Roseland, NJFall 2006-Summer 2007 Research Assistant, Professor Susan Herring, Indiana UniversitySummer 2006-Fall 2006 Visualization Research Intern, Microsoft Research, Redmond, WAFall 2003-Spring 2006 Research Assistant, Prof. Katy Borner (InfoVis Lab), Indiana University

B. Relevant publications

1. Börner, K., Penumarthy, S., Meiss, M., & Ke, W. (2006). Mapping the diffusion of scholarly knowledge among major US research institutions. Scientometrics, 68(3), 415-426. 2. Herr, B. W., Huang, W., Penumarthy, S., & Börner, K. (2008). Designing highly flexible and usable cyberinfrastructures for convergence. Annals of the New York Academy of Sciences, 1093(1 Progress in Convergence: Technologies for Human Wellbeing), 161-179. 3. Penumarthy, S., & Börner, K. (2003). The ActiveWorld Toolkti: Analyzing and Visualizing Social Diffusion Patterns in 3D Virtual Worlds. Workshop on Virtual Worlds: Design and Research Directions, MIT, Boston MA.4. Börner, K. Penumarthy S. (2004). Information Visualization Cyberinfrastructure. Position paper at the Workshop on Information Visualization Software Infrastructures, InfoVis 2004, Austin TX. 5. Huang, W., Herr, B., Penumarthy, S., Markines, B., & Börner, K. (2006). Cishell--A plug-in based software architecture and its usage to design an easy to use, easy to extend cyberinfrastructure for network scientists. Network Science Conference. 6. Börner, K., & Penumarthy, S. (2007). Spatio-Temporal information production and consumption of major US research institutions. Proceedings of ISSI, 1, 635-641.

7. Herr II, B. W., Duhon, R. J., Börner, K., Hardy, E. F., & Penumarthy, S. (2008) 113 years of physical review: Using flow maps to show temporal and topical citation patterns. Proceedings of the 12th International Conference on Information Visualization, Oct 19-24 Columbus OH.

C. Research Support

NSF IIS-0238261 Börner (PI) 10/01/03 - 09/30/08 National Science Foundation CAREER: Visualizing Knowledge Domains This project aims to bring the power of knowledge domain visualizations to any desktop connected to the Internet. Role: Research Assistant and Technical Lead

Microsoft Research Susan Herring (PI) 09/01/2006 – 05/10/20079 Month Unrestricted Cash Gift The project analyzed the effect of spam on the vitality of newsgroups in multiple languages over time.Role: Research Assistant

NAMEGully Alexander Peter Carey Burns

POSITION TITLENeuroinformatics Research ScientistInformation Sciences InstituteUniversity of Southern California

eRA COMMONS USER NAMEGullyBurnsEDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.)


Imperial College, London, England B.Sc. 1992 PhysicsOxford University, Oxford, England D.Phil. 1997 Physiology



1992-1995 Research Assistant, Laboratory of Physiology, Oxford University, England.1995-1997 Research Assistant, Neural Systems Group, Newcastle University, England.1997-1999 Research Associate, Swanson Laboratory, Department of Neurobiology, USC.1999-2006 Research Assistant Professor, Department of Neurobiology, USC.2006-present Neuroinformatics Research Scientist, Information Sciences Institute, USC.

B. Relevant publications (selected from 33 publications)

1. Burns, G.A.P.C., D. Feng, and E.H. Hovy (2008), "Intelligent Approaches to Mining the Primary Research Literature: Techniques, Systems, and Examples", in "Computational Intelligence in Biomedicine". In Press, To appear in Series in Studies in Computational Intelligence, Springer-Verlag, Germany.2. Burns, G., D. Feng, T. Ingulfsen, and E. Hovy (2007), "Infrastructure for Annotation-Driven Information Extraction from the Primary Scientific Literature: Principles and Practice". in 1st IEEE International Workshop on Service Oriented Technologies for Biological Databases and Tools (SOBDAT 2007). Salt-Lake City.3. Burns, G., W.-C. Cheng, R.F. Thompson, and L. Swanson (2006), "The NeuARt II system: a viewing tool for neuroanatomical data based on published neuroanatomical atlases." BMC Bioinformatics: 7:531.

4. Burns, G.A. and W.C. Cheng (2006), "Tools for Knowledge Acquisition within the NeuroScholar system and their application to anatomical tract-tracing data". J Biomed Discov Collab, 1(1): p. 10.5. Khan, A., J. Hahn, W.-C. Cheng, A. Watts, and G. Burns (2006), "NeuroScholar's Electronic Laboratory Notebook and its Application to Neuroendocrinology". Neuroinformatics, 4(2): p. 139-160.6. Burns, G.A., A.M. Khan, S. Ghandeharizadeh, M.A. O'Neill, and Y.S. Chen (2003), "Tools and approaches for the construction of knowledge models from the neuroscientific literature". Neuroinformatics, 1(1): p. 81-109.7. Burns, G., F. Bian, W.-C. Cheng, S. Kapadia, C. Shahabi, and S. Ghandeharizadeh (2002), "Software engineering tools and approaches for neuroinformatics: the design and implementation of the View-Primitive Data Model framework (VPDMf)". Neurocomputing, 44-46: p. 1049-1056.8. Stephan, K.E., L. Kamper, A. Bozkurt, G.A. Burns, M.P. Young, and R. Kotter (2001), "Advanced database methodology for the Collation of Connectivity data on the Macaque brain (CoCoMac)". Philos Trans R Soc Lond B Biol Sci, 356(1412): p. 1159-86.9. Burns, G., K. Stephan, B. Ludäscher, A. Gupta, and R. Kötter (2001), "Towards a federated neuroscientific knowledge management system using brain atlases". Neurocomputing, 38-40: p. 1633-1641.10. Burns, G.A. (2001), "Knowledge management of the neuroscientific literature: the data model and underlying strategy of the NeuroScholar system". Philos Trans R Soc Lond B Biol Sci, 356(1412): p. 1187-208.11. Stephan, K.E., C.C. Hilgetag, G.A. Burns, M.A. O'Neill, M.P. Young, and R. Kotter (2000), "Computational analysis of functional connectivity between areas of primate cerebral cortex". Philos Trans R Soc Lond B Biol Sci, 355(1393): p. 111-26.12. Hilgetag, C.C., G.A. Burns, M.A. O'Neill, J.W. Scannell, and M.P. Young (2000), "Anatomical connectivity defines the organization of clusters of cortical areas in the macaque monkey and the cat". Philos Trans R Soc Lond B Biol Sci, 355(1393): p. 91-110.13. Burns, G.A. and M.P. Young (2000), "Analysis of the connectional organization of neural systems associated with the hippocampus in rats". Philos Trans R Soc Lond B Biol Sci, 355(1393): p. 55-70.

C. Research Support

R01 GM 083871-1 (Burns) 4/1/2007– 3/31/2012 4.80 calendarBioScholar: a Biomedical Knowledge Engineering framework based on the published literatureThis work is a continuation of the NeuroScholar project funded by NLM. The major goals of this project is to create a deployable knowledge management / engineering system for bench scientists that may be constructed, curated and maintained within a single laboratory.

HHSN271200800426P (Burns) 5/6/2008 – 8/5/2008 0.75 calendarTopic Maps for CRISPThe major goal of this project is to build tools to that permit users to browse online ‘topic-maps’ for the CRISP database. This is the forerunner to the current SBIR application

n/a (Kesselman) 6/11/2008 – 6/10/2013 4.80 calendarSt. John's Health Center Center for Health InformaticsThe center for Health Informatics is a large scale, multidisciplinary center (incorporating intelligent systems, high-throughput networking and grid computing) with the mission to deliver turnkey information processing and delivery solutions to the clinical community. Dr Burns plays a leadership role within the center’s approach to biomedical ontologies.

1 R01 LM07061-04 (Burns) 5/01/01-04/30/07 Knowledge Management of the Neuroscientific Literature

This project involves the construction of a knowledge management system for neuroscientific information contained in the literature. It incorporates ontological work, visualization and analysis development and a study of the neural circuits underlying defensive behavior in the rat. This system called NeuroScholar is complete as a functional prototype and has been released as an open source project to the community.

1-year E-Sciences unrestricted cash gift, Ghandeharizadeh (PI) 01/01/05-31/12/06Microsoft

“Sangam, a system for integrating data to solve stress-circuitry-gene coupling”This research is a spin-off from work on the NeuroScholar system involving software called Proteus project from the database laboratory at the Computer Science Department at the University of Southern California. This research project is funded by a cash-gift from Microsoft Research and is concerned with developing an ‘EScience’ application that is built on integrating multiple sources of information into a single representation.

NAMEDavid Newman

POSITION TITLE

Research ScientistDepeartment of Computer ScienceUniversity of California, Irvine

eRA COMMONS USER NAMEDAVID NEWMAN

EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.)


University of Melbourne, Australia B.S. 1986 EngineeringPrinceton University M.S. 1992 EngineeringPrinceton University Ph.D. 1996 Engineering



1990-1994 Research Assistant, Princeton University, NJ1995-1996 Research Assistant, Brown University, NJ1997-1998 Postdoctoral Fellow, California Institute of Technology, CA2001-2005 Research Scientist, Dept. of Earth System Science, University of California, Irvine,

CA2005-present Research Scientist, Dept. of Computer Science, University of California, Irvine, CA

Honors and Awards

2007 TeraGrid award to investigate large scale topic modeling using TeraGrid resources.1996 Massachusetts Institute of Technology. Young Investigator Award.

B. Relevant publications (selected from17 publications)

1. Porteous, Newman, Ihler, Asuncion, Welling, Smyth. Fast Gibbs Sampling for Latent Dirichlet Allocation. In ACM SIGKDD Knowledge Discovery and Data Mining 2008.2. Newman, Asuncion, Welling, Smyth. Distributed Inference for Latent Dirichlet Allocation. In Neural Information Processing Systems 2007.3. Newman, Hage, Chemudugunta, Smyth. Subject Metadata Enrichment using Statistical Topic Models. In Joint Conference in Digital Libraries 2007.

4. Hage, Chapman, Newman. Enhancing Search and Browse Using Automated Clustering of Subject Metadata. In D-Lib Magazine July/August 2007.5. Newman, Chemudugunta, Smyth, Steyvers. Analyzing Entities and Topics in News Articles Using Statistical Topic Models. In Intelligence and Security Informatics 2006.6. Newman, Chemudugunta, Smyth, Steyvers. Statistical Entity-Topic Models. In ACM SIGKDD Knowledge Discovery and Data Mining 2006.7. Newman, Smyth, Steyvers. Scalable Parallel Topic Models. In Journal of Intelligence Community Research and Development 2006.8. Newman and Block. Probabilistic Topic Decomposition of and Eighteenth Century Newspaper. In Journal of the American Society for Information Science and Technology, 2006.9. Teh, Newman, Welling. A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation. In Neural Information Processing Systems 2006.

C. Research Support

NIH NINDS Contract ($80,000) 6/2009 – 10/2009Topic Maps for CRISP (NIH database of funded grants).Role: PI

IMLS Research Award ($750,000) 10/2009 – 9/2011Improving search and browse in digital libraries using topic modeling.Role: PI

NAMEBruce Herr

POSITION TITLEInformation Visualization EngineerChalkLabs, Bloomington, IndianaeRA COMMONS USER NAME

bherr2EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.)


Indiana University B.S. 2004 Computer Science



Summer 2008-present ChalkLabs, Bloomington, IN2004-Summer 2008 InfoVis Lab, Indiana University


1. Katy Börner, Elisha F. Hardy, Bruce W. Herr II, Todd M. Holloway, W. Bradford Paley. (2007). Taxonomy Visualization in Support of the Semi-Automatic Validation and Optimization of Organizational Schemas. Journal of Infometrics. Vol. 1(3), 214-225, Elsevier. 2. Bruce W. Herr II, Weixia Huang, Shashikant Penumarthy, Katy Börner. (2007). Designing Highly Flexible and Usable Cyberinfrastructures for Convergence. In Bainbridge, William S. & Roco, Mihail C. (Eds.), Progress in Convergence - Technologies for Human Wellbeing (Vol. 1093, pp. 161-179), Annals of the New York Academy of Sciences, Boston, MA. 3. Bruce W. Herr II, Weimo Ke, Elisha F. Hardy, Katy Börner. (2007). Movies and Actors: Mapping the Internet Movie Database. Conference Proceedings of 11th Annual Information Visualization

International Conference (IV 2007), Zürich, Switzerland, July 4-6, IEEE Computer Society Conference Publishing Services, pp. 465-469.

C. Research Support

NSF IIS-0238261 Borner (PI) 10/01/03 - 09/30/08 National Science Foundation CAREER: Visualizing Knowledge Domains This project aims to bring the power of knowledge domain visualizations to any desktop connected to the Internet. Role: software developer NSF IIS-0513650 Borner (PI) 09/01/05 - 08/31/08 National Science Foundation SEI: NetWorkBench: A Large-Scale Network Analysis, Modeling and Visualization Toolkit for Biomedical, Social Science and Physics Research. This project will design, evaluate, and operate a unique distributed, shared resources environment for large-scale network analysis, modeling, and visualization, named Network Workbench (NWB). Role: software developer

NAMEGavin LaRowe

POSITION TITLEChief Technologist & CEOChalkLabs, Bloomington, IndianaeRA COMMONS USER NAME

GL-CHALKLABSEDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, and include postdoctoral training.)


University of Puget Sound B.A. 1995 Foreign Literatures & Computer Science

Indiana University M.S. 2006 Information Science



Summer 2008-present ChalkLabs, Bloomington, INFall 2007-Summer 2008 Mind Alliance Systems, Roseland, NJSpring 2004-Fall 2007 InfoVis Lab, Indiana University

Honors and Awards

Fall 2004 Fellow, Swedish Collegium for Advanced Study, Uppsala, SwedenSpring 2006 Fellow, Swedish Colledium for Advanced Study, Uppsala, Sweden


1. Gavin LaRowe, Sumeet Ambre, Weimao Ke & Katy Börner (2008). The Scholarly Database and Its Utility for Scientometrics Research. To appear in the special issue of Scientometrics on the 11th ISSI.2. LaRowe, Gavin, Ambre, Sumeet Adinath, Burgoon, John W., Ke, Weimao & Katy Börner. (2007). The Scholarly Database and Its Utility for Scientometrics Research. Torres-Salinas, D & Moed, H F.

(Eds.), Proceedings of the 11th International Conference on Scientometrics and Informetrics, Madrid, Spain, June 25-27, pp. 457-462. 3. LaRowe, Gavin, Ichise, Ryutaro & Börner, Katy. (2007). Visualizing Japanese Co-Authorship Data. Proceedings of the 11th Annual Information Visualization International Conference, Zürich, Switzerland, July 4-6, IEEE Computer Society Conference Publishing Services, pp. 459-464.

C. Research Support

NSF IIS-0534909 Borner (PI) 3/15/2006 – 8/01/2007 National Science Foundation COLLABORATIVE SYSTEMS: Social Networking Tools to Enable Collaboration in the Tobacco Surveillance, Epidemiology, and Evaluation Network (TSEEN). The project is a pioneering effort at incorporating social network referral tools as an integral part of collaborative systems within the context of digital government. First, the proposed project will extend theoretical understanding of the emergence of collaboration network structures involving multidimensional networks. Role: DBA & Software Lead

h. Subcontractors/ConsultantsChalkLabs will be sub-contracting work out to topicSeek LLC and ISI. All work for topicSeek LLC will be performed solely by Dr. Newman who will perform topic modeling related work described in Section c.4 “Integrate the Topic Model algorithm to analyze and generate similarity networks for visualization”. ChalkLabs will assign specific data analysis and modeling tasks to topicSeek. Newman will be responsible for generating similarity networks of documents as well as assisting in the design of the data model to be used in the software application. Newman’s role is also indicated in an accompanying letter of support from topicSeek. All work for ISI will be performed by Dr. Burns, who is a subject-matter expert in the area of biomedical research. ChalkLabs will use Burns’ expertise in order to collect requirements from NIH researchers, design the user interface, perform usability testing and create the final report as described in Sections c.1 and c.6. Burn’s role is also indicated in an accompanying letter of support from ISI.

i. Facilities and Equipment

Name: ChalkLabs, Bloomington, IN (offeror organization)

Primary Contact: Gavin LaRowe, Owner & CEO

Desc: ChalkLabs is small business entity that focuses on advanced research and web applications development and services, with strong emphases in network science, information visualization, and data mining.

Office: ChalkLabs is located in the IU Research and Technologies Park in the Showers complex in Bloomington, IN. The primary mission of the Research and Technology Park is to, 'establish a first-class research park at Bloomington that will be a focal point for future partnerships between university researchers and industry.' With over 52,000 square feet of leasable office space, many cutting-edge IU-related technology and research organizations such as the Pervasive Technologies Lab, the Advanced Network Management lab, the Open Systems lab, and the Internet2 research offices are located here. In addition, six other IT-related businesses including Information In Place, Inc., RightRez occupy this space. ChalkLabs is also part of the Inventure technology incubator run by the Small Business Development Corporation (SDBC) in Bloomington, IN, providing access to many local businesses, venture capital, and finance organizations who are affiliated with the SDBC.

Staff: As of 9/02/08, ChalkLabs has 5 FTEs. Based on current projections, this number will double by December of 2008.

Computer: ChalkLabs administers a robust network of Linux-based, OS X, and Windows-based servers and desktop machines dedicated towards both research and development. All of these computers are protected by an uninterruptible power supply and backup generators for 24x7x365 operation. Research project staff have an average of over one workstation per staff member, connected to a high performance switched 10Gbps ethernet LAN backbone with 10Gbps connectivity to external research networks. Aside from internal computing resources, various staff members have access to research grid-computing and cluster-computing infrastructures at Indiana University, such as 'Big Red' and 'AVIDD'. The Big Red Cluster -- IU's latest high-performance computing system -- is a 512-node distributed shared-memory cluster, designed around IBM's BladeCenter JS21. Each JS21 node contains two dual-core 2.5GHz PowerPC 970MP processors, 8GB ECC SDRAM, a 72GB SATA hard disk for local scratch space, and a PCI-X Myrinet 2000 adapter for high-bandwidth, low-latency MPI applications. In its initial configuration, the cluster is running SuSE Linux Enterprise Server 9, with IBM's LoadLeveler and the Moab Workload Manager for batch job management. Big Red users have access to a 266TB GPFS filesystem for analysis and temporary storage of large datasets, as well as native access via the Lustre client to the 535TB Data Capacitor. IBM's PowerPC 970MP processor contains two double precision floating point units per core. A single node contains four cores, each capable of four floating point operations per cycle. The Myrinet 2000 interconnect provides a 2+2Gb/s low-latency (2.6-3 microseconds) network for MPI communication. Each JS21 is equipped with a Myricom M3S-PCIXD-2-I adapter connected directly to one of two Myricom M3-CLOS-ENCL 256-port switches. Research and development staff have an average of two workstations per staff member, connected to a high-speed switched 12Gbps ethernet LAN. The AVIDD (Analysis and Visualization of Instrument-Driven Data) facility is a distributed 2.2 TeraFLOPS Linux cluster, including more than 10 TB of disk space. AVIDD is frequently used to parse and analyze very large data sets. IU also operates a 1.02 TFLOPS IBM SP, which includes a large-memory Regatta node (96 GB RAM for this node alone). The Regatta node of the SP is often used for analysis of very large data sets, and for importing external public data sources into Oracle databases. _______

Name: Information Sciences Institute (ISI), Marina del Rey, CA

Primary Contact: Gully Burns, ISI

Desc.: ISI-USC is one of the premier institutes for advanced artificial intelligence research and new media applications to education and training, with more than 300 researchers working on innovative technology applications sponsored by DARPA, ARDA, NSF, NSA, and other agencies.

Office: ISI includes office space in a 12 story office building located 20 minutes from the main USC campus; each floor of the building includes multiple conference rooms and video conference units. All team members have separate near-adjoining offices on the north side of the building on the fourth floor.

Staff:

Computer: The computer center has been an integral part of ISI since its founding in 1972. Today's Information Processing Center (IPC) maintains a state-of-the-art computing environment and staff to provide the technical effort required to support the performance of research. Resources include client platform and server hardware support, distributed print services, network and remote access support, operating systems and application software support, computer center operations, and help desk coverage. The IPC also acts as a technical liaison to the ISI community on issues of acquisition and integration of computing equipment and software.

The Center's servers are protected by an uninterruptible power supply and backup generator to ensure availability 24 hours a day, 365 days a year. A rich mix of computer and network equipment along with modern software tools for the research community's use provide a broad selection of capabilities, including

Unix-based Sun servers and Windows-based Dell servers used for electronic mail and group calendaring, web services, file and mixed application serving. File servers utilize high-performance RAID and automated backup to facilitate performance and data protection. Computer room space is also available to researchers for hosting project-related servers. In addition, research staff have access to grid-enabled cluster computing, and to USC's 5,400-CPU compute cluster with low latency Myrinet interconnect that is the largest academic supercomputing resource in Southern California. All printers (color and b/w) are networked and available for unrestricted use. This includes one color photocopier per floor.

Research project staff have an average of over one workstation per staff member, connected to a high performance switched 10Gbps ethernet LAN backbone with 10Gbps connectivity to research networks such as Internet 2, as well as additional network resources such as IP multicast, 802.11b and 802.11g wireless, H323 point-to-point and multipoint videoconferencing, webcasting and streaming media.

BibliographyAdai, A. T., Date, S. V., Wieland, S., & Marcotte, E. M. (2004). LGL: Creating a map of protein function with an algorithm for visualizing very large biological networks. Journal of Molecular Biology, 340(1), 179-190.Alvarez-Hamelin, J. I., Dall'Asta, L., Barrat, A., & Vespignani, A. (2005). K-Core decomposition: A tool for the visualization of large scale networks. Arxiv Preprint Cs.Ni/0504107.Batagelj, V., & Mrvar, A. (1998). Pajek-Program for large network analysis. Connections, 21(2), 47-57.Bederson, B. B., Shneiderman, B., & Wattenberg, M. (2002). Ordered and quantum treemaps: Making effective use of 2D space to display hierarchies. ACM Transactions on Graphics (TOG), 21(4), 833-854.Blei, D.M., A.Y. Ng, and M.I. Jordan (2003), "Latent Dirichlet Allocation". Journal of Machine Learning Research, 3: p. 993-1022Borner, K., Sanyal, S., & Vespignani, A. (2007). Network science. Annual Review of Information Science and Technology, 41, 537.Burns, G., B. Herr, D. Newman, T. Ingulfsen, P. Pantel, and P. Smyth (2007), "A snapshot of neuroscience: unsupervised natural language processing of abstracts from the Society for Neuroscience 2006 annual meeting". in Annual Meeting of the Society for Neuroscience. San Diego. http://scimaps.org/maps/neurovis. p. 100.6 / XX26Davidson, G. S., Hendrickson, B., Johnson, D. K., Meyers, C. E., & Wylie, B. N. (1998). Knowledge mining with vxinsight: Discovery through interaction. Journal of Intelligent Information Systems, 11(3), 259-285.Di Battista, G., Eades, P., Tamassia, R., & Tollis, I. G. (1998). Graph drawing: Algorithms for the visualization of graphs. Prentice Hall PTR Upper Saddle River, NJ, USA.Ellson, J., Gansner, E. R., Koutsofios, E., North, S. C., & Woodhull, G. (2003). Graphviz and dynagraph: Static and dynamic graph drawing tools. Graph Drawing Software, 127-148.Fruchterman, T. M. J., & Reingold, E. M. (1991). Graph drawing by force-directed placement. Software- Practice and Experience, 21(11), 1129-1164.Griffiths, T.L. and M. Steyvers (2004), "Finding scientific topics". Proc Natl Acad Sci U S A, 101 Suppl 1: p. 5228-35Lima, M. (2007). Visual complexity. Online at: http://www.visualcomplexity.com Last Accessed: Oct 13, 2008Martin, S., W.M. Brown, R. Klavans, and K.W. Boyack (2007), "DrL: Distributed Recursive (Graph) Layout". Journal of Graph Algorithms and Applications, 1(1): p. 1Moere, A. V. (2007). Information aesthetics. Online at: http://infosthetics.com. Last Accessed: Oct 13, 2008Newman, D., C. Chemudugunta, P. Smyth, and M. Steyvers (2006), "Analyzing Entities and Topics in News Articles Using Statistical Topic Models". in LNCS -- IEEE ISI. San Diego.Palantir Technologies. (n.d.). Palantir technologies. Online at: http://www.palantirtech.com/ Last Accessed: Oct 13 2008

http://www.palantirtech.com/

http://infosthetics.com/

http://www.visualcomplexity.com/

http://scimaps.org/maps/neurovis

Plaisant, C., Grosjean, J., Bedersonn, B. B., (2002). Spacetree: Supporting exploration in large node link tree, design evolution and empirical evaluation. University of Maryland College Park, Human-Computer Interaction Lab. Schroeder, W., Martin, K. M., & Lorensen, W. E. (1998). The visualization toolkit: An object-oriented approach to 3D graphics. Prentice-Hall, Inc. Upper Saddle River, NJ, USA.Siek, J., Lee, L. Q., & Lumsdaine, A. (2002). The boost graph library: User guide and reference manual. Addison-Wesley.TouchGraph, L. L. C. Touchgraph. Online at: http://www.touchgraph.com. Last Accessed: Oct 13, 2008Tong, A.H., G. Lesage, G.D. Bader, H. Ding, H. Xu, X. Xin, J. Young, G.F. Berriz, R.L. Brost, M. Chang, Y. Chen, X. Cheng, G. Chua, H. Friesen, D.S. Goldberg, J. Haynes, C. Humphries, G. He, S. Hussein, L. Ke, N. Krogan, Z. Li, J.N. Levinson, H. Lu, P. Menard, C. Munyana, A.B. Parsons, O. Ryan, R. Tonikian, T. Roberts, A.M. Sdicu, J. Shapiro, B. Sheikh, B. Suter, S.L. Wong, L.V. Zhang, H. Zhu, C.G. Burd, S. Munro, C. Sander, J. Rine, J. Greenblatt, M. Peter, A. Bretscher, G. Bell, F.P. Roth, G.W. Brown, B. Andrews, H. Bussey, and C. Boone (2004), Global mapping of the yeast genetic interaction network. Science, 303(5659): p. 808-13van Ham, F., & van Wijk, J. J. (2003). Beamtrees: Compact visualization of large hierarchies. Information Visualization, 2(1), 31-39.

Current Awards and Pending Proposals/ApplicationsNone

http://www.touchgraph.com/

Date post:	23-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

PHS 398 (Rev. 9/04), Continuation...

Documents