ALA 2010 -- Jeremy York

HATHI TRUSTA Shared Digital Repository

Delivering Data For New Generations of ResearchNew Generations of Research

Strategies and ChallengesStrategies and ChallengesJeremy York

NISO/BISG ForumNISO/BISG ForumALA 2010

IntroductionIntroduction

• Digital RepositoryDigital Repository– Initial focus on digitized book and journal content

– “Light” archive– Light archive

• Collections and CollaborationC h i ll ti– Comprehensive collection

– Shared strategies

Local services– Local services

– Public Good

Content DistributionContent Distribution

19%

In Copyright

81%Public Domain

6,173,575 – Total1,177,667 – Public Domain

* As of June 15, 2010

Language Distribution (1)Language Distribution (1)

The top 10 languages make up ~86%

ItalianArabic2%

Polish1% Remaining

p g g p %of all content

English48%

h

Japanese4%

Italian3%

2% Languages14%

48%

FrenchSpanish

Chinese4%

German8%

French7%

Russian5%

Spanish4%

5%

* As of June 15, 2010

Language Distribution (2)Language Distribution (2)

Serbian%

Romanian%

Ancient‐GreekYiddishSlovenian%

Multiple

The next 40 languages make up

Hindi6%

Portuguese6%

Hebrew

Vietnamese2% Ukrainian

2%Bulgarian

2%

1%

Armenian1%Greek1%

Panjabi1%

Malay1%Catalan1%

1%Malayalam1% Slovak

1%

1%1%

Finnish1%

p1% ~13% of total

Hebrew6%

Indonesian6%

D t hNorwegian

Hungarian2% Sanskrit

2%

Ukrainian2%

1%1% 1%1%

Dutch5%

LatinKorean2%

Bengali2%

Norwegian2%

5%Urdu4%

Swedish4%TurkishCzechThaiDanish

Undetermined3%Tamil

Persian3%

2%

4%Turkish4%

Unknown4%

Czech3%

Thai3%3%Croatian

3%

3%

* As of June 15, 2010

Originating InstitutionOriginating Institution

Uni ersit of Indiana University of Penn State University of Wisconsin

6%

University3%

University of Minnesota

1%University

0%

University of California

University of Michigan65%

25%

65%

* As of June 15, 2010

Content over timeContent over time

80%

100%

40%

60% Minnesota

Penn State

California

0%

20%

4

California

Indiana

Wisconsin

Michigan

Sep‐04

Nov‐04

Jan‐05

Mar‐05

May‐05

Jul‐0

5

Sep‐05

Nov

‐05

an‐06

ar‐06

y‐06

MichiganN Ja

Ma

May

* As of June 15, 2010

Content GrowthContent Growth

Data Distribution & APIsData Distribution & APIs

• OAI‐PMHOAI PMH

• Metadata files

ibli hi• Bibliographic API

• Data API

Extended ServicesExtended Services

• Community Development EnvironmentCommunity Development Environment

• Non‐Google Ingest

k/ l• Non‐Book/Non‐Journal Ingest

• Computational Research

Strategies for Computational ResearchStrategies for Computational Research

• Data distributionData distribution

• Protocol‐based access

h C• Research Center

SEASR ArchitectureVisualizationsVisualizations

AppsApps ServicesServicesPluginsPluginsWeb AppsWeb Apps

User InterfacesUser Interfaces

ComponentsComponents

Meandre Data‐Intensive FlowsMeandre Data‐Intensive Flowsr Tools

r Tools

RepositoriesRepositories

Meandre WorkbenchMeandre Workbench

ComponentsComponents

Meandre InfrastructureMeandre Infrastructure

VisualizationVisualization

Component RepositoryComponent Repository Component DiscoveryComponent Discovery

AnalyticsAnalyticsDataData

Develop

erDevelop

er DataAnalysis

ComponentsFlows

DataAnalysis

ComponentsFlows

Virtualization InfrastructureVirtualization Infrastructure

Cloud ComputingCloud Computing

SEASR @ Work – Tag Cloud

• Count tokens• Filter options• Filter options

supportedSt d• Stem words

SEASR @ Work – Entity Mash-upE tit E t ti ith• Entity Extraction with OpenNLP or Stanford NER

• Locations viewed on Google Map D i d• Dates viewed on Simile Timeline

SEASR @ Work – Entities To Network

• Identify entities• Define relationships between entities withinDefine relationships between entities within

same sentence

SEASR @ Work – Text Clustering

• Clustering of Text by token counts• Filtering options for stop words Part of Speech• Filtering options for stop words, Part of Speech• Dendogram Visualization

SEASR @ Work – Audio Analysis• NEMA: Executes a SEASR

flow for each run

– Loads audio data– Loads audio data

– Extracts features for every 10 sec moving

i d f diwindow of audio

– Loads and applies the models

– Sends results back to the WebUI

NESTER: Annotation of• NESTER: Annotation of Audio via Spectral Analysis

SEASR @ Work – Zotero• Plugin to Firefox • Zotero manages the

collection• Launch SEASR Analytics

– Citation Analysis uses the– Citation Analysis uses the JUNG network importance algorithms to rank the authors in the citation network that is exported as RDF data from Zotero to SEASR

– Zotero Export to Fedora through SEASRthrough SEASR

– Saves results from SEASR Analytics to a Collection

• Launch MONK• Launch MONK Processing– MONK DB Ingestion Workflow

SEASR @ Work – Emotion Tracking

Goal is to have this type of Visualization to track emotions across a text document (Leveraging flare.prefuse.org)

Sentiment Analysis: Visualization

Person Extraction:Scott's Waverley, Ivanhoe, and The Heart of Midlothian.

Location Extraction:Top: Walter Scott's Waverley Bottom: Maria Edgeworth's Castle Rackrent

Thank you!

hathitrust‐[email protected]@umich.edu

Date post:	28-Nov-2014
Category:	Education
Upload:	bisg
View:	930 times
Download:	2 times

ALA 2010 -- Jeremy York

Education