AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchivingAGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
Preparing Digital
Collections for Big
Data AnalysisSven Schlarb, Austrian Institute of Technology
e-Archiving, Cordoba, Spain
05th October 2018
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
Digital Transformation
Copyright Doc Searls, https://flic.kr/p/9o5AEY
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
Digital Transformation
Copyright (network diagram) https://www.wikidata.org/wiki/User:Thepwnco, CC BY-SA 4.0
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
4
Archiving at internet scale
2003
2018
https://web.archive.org/web/*/https://www.cordoba.es/
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
5
05/10/2018
Is big data still a hype?2014
BIG DATA
Jeremykemp at English Wikipedia [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-
SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
6
Is big data still a hype?2015
Jeremykemp at English Wikipedia [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-
SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
7
Is big data still a hype?2018
BIG DATA
Jeremykemp at English Wikipedia [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-SA
3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
• Relational databases
8
To SQL or to NoSQL?• NoSQL databases
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
NoSQLDatabases
Key-Value Wide
Column
DocumentGraph
Person
Event
Person
{
"name": "Sven Schlarb",
"email": "sven.schlarbait.ac.at",
"events": [
{
"name": "Kulturhackathon openGLAM.at",
"date": "2018-09-22T00:00:00.000Z"
},
{
"name": "e-Archving Cordoba",
"date": "2018-10-05T00:00:00.000Z"
}
]
}
K1 AAA,BBB,CCC
K2 AAA,BBB
K3 AAA,DDD
K4 AAA,2,01/01/2018
K5 3,ZZZ,5623
Key Participant Conference
ID Name City Name Address City
1 John London PVC2018 Townroad 2 Manchester
2 Linda Palme TFC2018 Market 2 Berlin
Different Nosql database types
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
Job TrackerTask Trackers
Data Nodes
Name Node
CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading)
RAM: 16GB
DISK: 2 x 1TB DISKs configured as RAID0 (Performance) – 2 TB
effective
• Of 16 HT cores: 5 for Map; 2 for Reduce; 1 for OS.
25 processing cores for Map tasks
10 processing cores for Reduce tasks
CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores)RAM: 24GBDISK: 3 x 1TB DISKs configured as RAID5 (Redundanz) – 2 TB effective
E-ARK Experimental Cluster
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
• Modular package
transformation workflows
& metadata creation
• Parallelize full-text
indexing
•Fast random access
to individual files
•Aggregating data
using facet queries
•Data mining (Classification,
NER)
Faceted Search & Data Mining
Access
Full-text indexing & search
Package transformation and Ingest
Reference Implementation
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
SIP
E-ARK Information Package (simplified)
representations
metadata
[schemas/documentation]
Structural metadata
Provenance metadata
Technical metadata
Descriptive metadata
SIP
DIP
DIPMetadata edits
Migrations
Add emulation info
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
• earkweb is based on Phython and the Celery task
execution system.
– Create archival workflows from predefined tasks which
can be executed in parallel on a computer cluster.
– Examples are data validation, format migration, content
extraction, database transformation, packaging,
interfacing with storage systems.
– earkweb provides a graphical interface and can be
used interactively as well as in batch mode.
earkweb
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
6/30/16
Worker Worker Worker Worker
Staging/Storage Area
NAS <<package transfer>>
decoupled
<<notification>>
<<search and retrieval>>
Information
package
status
Task
results
Cluster Deployment Stack
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
Standalone Deployment Stack
6/30/16
Worker Worker Worker Worker
Staging/Storage Area
NAS <<indexing>>
<<search and retrieval>>
Information
package
status
Task
results
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
Data Mining/NLP
•Purpose: Analyse digital resources of collections
•Selected use cases: Location names occurring in texts.
Named entity recognition and incorporation of geo-
information
Text classification
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
Location names occurring in texts
StanfordNER for NER
nominatim (database behind
openstreetmap.org) for georeferencing
peripleo for visualization
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
Location names occurring in texts
Peripleo - PELAGIOS Project
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
Geographical/timeline search
Peripleo - PELAGIOS Project
Provided: GML data and TIFF images of maps with metadata (coordinate system, time, etc.)
Convert GML data to Peripleo RDF
Translate coordinate system if necessary
Use peripleo to search for and visualize regions and filter by time
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
Geographical/timeline search
Peripleo - PELAGIOS Project
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchivingAGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
Text classification using
scikit-learn Prepare data to train SVM classifier
Dump full-texts of the repository into re-
usable packages
Apply text classification and update SolR
records accordingly
AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving
Database archiving, rebuilding
and analysis
source: wikipedia
SIARD
RDBMS
data
(up to 80TB)
e.g. Postgres e.g. Oracle
Submit ... Archive ... Reconstruct ... Analyse
.