HATHI TRUSTA Shared Digital Repository
Delivering Data For New Generations of ResearchNew Generations of Research
Strategies and ChallengesStrategies and ChallengesJeremy York
NISO/BISG ForumNISO/BISG ForumALA 2010
IntroductionIntroduction
• Digital RepositoryDigital Repository– Initial focus on digitized book and journal content
– “Light” archive– Light archive
• Collections and CollaborationC h i ll ti– Comprehensive collection
– Shared strategies
Local services– Local services
– Public Good
Content DistributionContent Distribution
19%
In Copyright
81%Public Domain
6,173,575 – Total1,177,667 – Public Domain
* As of June 15, 2010
Language Distribution (1)Language Distribution (1)
The top 10 languages make up ~86%
ItalianArabic2%
Polish1% Remaining
p g g p %of all content
English48%
h
Japanese4%
Italian3%
2% Languages14%
48%
FrenchSpanish
Chinese4%
German8%
French7%
Russian5%
Spanish4%
5%
* As of June 15, 2010
Language Distribution (2)Language Distribution (2)
Serbian%
Romanian%
Ancient‐GreekYiddishSlovenian%
Multiple
The next 40 languages make up
Hindi6%
Portuguese6%
Hebrew
Vietnamese2% Ukrainian
2%Bulgarian
2%
1%
Armenian1%Greek1%
Panjabi1%
Malay1%Catalan1%
1%Malayalam1% Slovak
1%
1%1%
Finnish1%
p1% ~13% of total
Hebrew6%
Indonesian6%
D t hNorwegian
Hungarian2% Sanskrit
2%
Ukrainian2%
1%1% 1%1%
Dutch5%
LatinKorean2%
Bengali2%
Norwegian2%
5%Urdu4%
Swedish4%TurkishCzechThaiDanish
Undetermined3%Tamil
Persian3%
2%
4%Turkish4%
Unknown4%
Czech3%
Thai3%3%Croatian
3%
3%
* As of June 15, 2010
Originating InstitutionOriginating Institution
Uni ersit of Indiana University of Penn State University of Wisconsin
6%
University3%
University of Minnesota
1%University
0%
University of California
University of Michigan65%
25%
65%
* As of June 15, 2010
Content over timeContent over time
80%
100%
40%
60% Minnesota
Penn State
California
0%
20%
4
California
Indiana
Wisconsin
Michigan
Sep‐04
Nov‐04
Jan‐05
Mar‐05
May‐05
Jul‐0
5
Sep‐05
Nov
‐05
an‐06
ar‐06
y‐06
MichiganN Ja
Ma
May
* As of June 15, 2010
Content GrowthContent Growth
Data Distribution & APIsData Distribution & APIs
• OAI‐PMHOAI PMH
• Metadata files
ibli hi• Bibliographic API
• Data API
Extended ServicesExtended Services
• Community Development EnvironmentCommunity Development Environment
• Non‐Google Ingest
k/ l• Non‐Book/Non‐Journal Ingest
• Computational Research
Strategies for Computational ResearchStrategies for Computational Research
• Data distributionData distribution
• Protocol‐based access
h C• Research Center
SEASR ArchitectureVisualizationsVisualizations
AppsApps ServicesServicesPluginsPluginsWeb AppsWeb Apps
User InterfacesUser Interfaces
ComponentsComponents
Meandre Data‐Intensive FlowsMeandre Data‐Intensive Flowsr Tools
r Tools
RepositoriesRepositories
Meandre WorkbenchMeandre Workbench
ComponentsComponents
Meandre InfrastructureMeandre Infrastructure
VisualizationVisualization
Component RepositoryComponent Repository Component DiscoveryComponent Discovery
AnalyticsAnalyticsDataData
Develop
erDevelop
er DataAnalysis
ComponentsFlows
DataAnalysis
ComponentsFlows
Virtualization InfrastructureVirtualization Infrastructure
Cloud ComputingCloud Computing
SEASR @ Work – Tag Cloud
• Count tokens• Filter options• Filter options
supportedSt d• Stem words
SEASR @ Work – Entity Mash-upE tit E t ti ith• Entity Extraction with OpenNLP or Stanford NER
• Locations viewed on Google Map D i d• Dates viewed on Simile Timeline
SEASR @ Work – Entities To Network
• Identify entities• Define relationships between entities withinDefine relationships between entities within
same sentence
SEASR @ Work – Text Clustering
• Clustering of Text by token counts• Filtering options for stop words Part of Speech• Filtering options for stop words, Part of Speech• Dendogram Visualization
SEASR @ Work – Audio Analysis• NEMA: Executes a SEASR
flow for each run
– Loads audio data– Loads audio data
– Extracts features for every 10 sec moving
i d f diwindow of audio
– Loads and applies the models
– Sends results back to the WebUI
NESTER: Annotation of• NESTER: Annotation of Audio via Spectral Analysis
SEASR @ Work – Zotero• Plugin to Firefox • Zotero manages the
collection• Launch SEASR Analytics
– Citation Analysis uses the– Citation Analysis uses the JUNG network importance algorithms to rank the authors in the citation network that is exported as RDF data from Zotero to SEASR
– Zotero Export to Fedora through SEASRthrough SEASR
– Saves results from SEASR Analytics to a Collection
• Launch MONK• Launch MONK Processing– MONK DB Ingestion Workflow
SEASR @ Work – Emotion Tracking
Goal is to have this type of Visualization to track emotions across a text document (Leveraging flare.prefuse.org)
Sentiment Analysis: Visualization
Person Extraction:Scott's Waverley, Ivanhoe, and The Heart of Midlothian.
Location Extraction:Top: Walter Scott's Waverley Bottom: Maria Edgeworth's Castle Rackrent