A Case Study on the Content Curation for the Improving Effectiveness of
Research Report
2 0 1 8 . 1 2 . 0 4
20 t h G reyNe t I n t e rna t i ona l Con fe rence
Seokjong Lim
Content Curation Center, [email protected]
CONTENTS
I. Content Curation?
II. The KISTI Curation Model
III. The KISTI Curation Cases
I. Content Curation
1.1 Definition of Content Curation (1/3)
4
Is a set of activities to systematically store data sets for current and future users
to encourage the reuse of content and to valorize produced content.
- Examples: research results, research data, public (government) data,
and cultural heritage data. [Source] Digital Curation Centre. What is digital curation?
Is a set of active and ongoing data management actions to render the data lifecycle
useful for science and education.
Data discovery and search, quality management, valorization, and reuse.
Related areas
Authentication, Archiving, Management, Preservation, Retrieval, and Representation.
[Source] CLIR. What is data curation? https://www.clir.org/initiatives-partnerships/data-curation/
Data Curation?
Digital Curation?
5
• Is a set of activities that systematically collect
science and technology content according to a
standardized protocol, and establish a database
accordingly to promote reuse. It valorizes results
to strengthen the research impact of Korean
researchers
1.1 Definition of Content Curation (2/3)
6
gets started with a brand new concept something like a specific subject or interest.
making a new category according to the defined concept.
becoming to get constructive content something different.
UserDiscovery
Search
Reuse
QualityMaintenance
Give Value
Sustainable data management
Content Curation
1.1 Content Curation (3/3)
1.2. The Needs for Curation
7
General need for curation
The amount of data is increasing exponentially due to the heated research competition, the increased number of researchers, and the rapid advancement of IT technologies.
The range, forms, and amount of data content vary largely. Thus, it is necessary to develop new methods to scientifically and systematically collect, manage, and store contents considering their future reuse.- Digital Curation Centre. What is digital curation? http://www.dcc.ac.uk/digital-curation/what-digital-curation
“Providing optimized content for users in the age of information overload.”- Duyeong Heo. Contents Curation, 2016.
It is necessary to establish a long-term plan for data file extension, system operation, and media conversion to ensure the continued use of data despite the rapid advancement of technologies.– Ross Harvey. Digital curation: A how-to-do-it manual., 2010.
It is necessary to prepare for the use and reuse of data by current and future users.
1.2. The Needs for Curation
8
KISTI’s need for curation
A content curation model that reflects the KISTI’s missions is required to
research, develop, and establish a service framework for the knowledge
information infrastructure of science and technology.
A content management policy for the age of big data is necessary to properly collect,
analyze, and provide science and technology information that meets the needs of
researchers.
It is important to continue to develop relevant technologies and policies for science
and technology information, as well as to standardize its management and distribution,
in order to support the development of science and technology and key industries of
Korea.
A content curation model is required to establish a national high-value-added inf
ormation infrastructure.
II. The KISTI Curation Model
2.1 Methods
10
Literature analysis
Literature review on curation models and analysis of related internal KISTI
manuals and documents.
In-depth interview with KISTI staff
Focused on 1:1 interview
Conducting an in-depth interview with field staff to identify the current
agenda of KISTI content establishment and the switch to the
content curation system.
Benchmarking outstanding
modelsBenchmark analysis
Benchmarking outstanding curation lifecycle models in other countries,
including DCC, DCC&U, and UC3.
KISTI Curation Model testing
KISTI model testingKISTI Curation Lifecycle Model testing in consultation with the Digital Curation Center (a globally renowned British
research institute specialized in digital curation) and Korean digital curation
experts.
Methods
2.2 Benchmark
11
Creation Acquisition Database Service
The range of content curation
Content Curation Center
2.3 The Range of Content Curation
12
2.4 The KISTI Curation Lifecycle Model (Draft)
13
* This tentative model is currently under development and will be finalized in consultation
with the British Digital Curation Center and Korean experts (estimated to be completed by
November 2018).
Emphasize performance by tasks by focusing on the job role of the department(s) responsible for curation based on the hierarchical actions of the organization
A model demonstrating the main curation actions performed by KISTI
*SA : Semantic Description
Create/ReceiveAppraise/
SelectSemantic
Description
Individual
identification
Organization
identification
Term identific
ation
Subject sorting
Reference item identific
ation
Metaextracti
on
Non-textual table and
figure extracti
on
Original content convers
ion(PDF ->XML)
Original content convers
ion(PDF ->PDF/A)
Research and data
connection(DLI)
Individual name recogni
tion
Personal
information
removal
DOIregistra
tion
Funding informa
tion connect
ion
Similarity check
KnoBaS(~2017)
KnoBaS(~2017)
KnoBaS(~2017)
Developed in 2018
Developed in 2018
S&T(~2017,
food area)
NRMS(~2017)
NRMS(~2017)
NRMS(~2017)
NRMS(구매)
(2017~)KDCRPMSNRMS
(2017~)
Connecting ISNI, ORCID, and KRI (scientist and technician registration number)
Automatic identification
Connecting papers with NTIS
Automatic sorting Automatic extraction
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Content description(Semantic description)
Ingest
Conceptualize Dispose
SA
2.5 Korean Paper Collection (K-Paper) (4/9)
Sequential Actions: SA
14
Developed in 2018
SCOPUS
Authoridentification
OrganizationidentificationLiterat
ureidentification
Funding
information
Subject
classification code
Citations
Tables and
figuresPaper registration
information
Research
publication
information
Original
URL
Statistics
DOI
DLI
KIS
TICro
ssRef
SA
*SA : Semantic Description
License
information
Abstract
2.6 Korean Paper Collection (K-Paper) (5/9)
Sequential Actions: SA
15
2.7 Korean Paper Collection (K-Paper) (6/9)
Occasional Actions: OA
16
*OA : Dispose
Individual
identification
Organization
identification
Termidentific
ation
Subject classific
ation
Reference item identific
ation
Meta-extracti
on
Non-text
(tables and
figures) extracti
on
Original docume
nt convers
ion(PDF ->XML)
Original docume
nt conversion(PDF
->PDF/A)
Research
data connect
ion(DLI)
Entity name
recognition
Erasing persona
l informa
tion
DOIregistra
tion
FundingInforma
tion
Similarity check
KnoBaS(~2017)
KnoBaS(~2017)
KnoBaS(~2017)
2018개발 2018개발 2018개발S&T
(~2017, food area)
NRMS(~2017)
NRMS(~2017)
NRMS(~2017)
NRMS(구매)
(2017~)KDCRPMSNRMS
(2017~)
Linking of ISNI,ORCID & KRIReferenceidentification
Connecting paper-NTIS SubjectClassification
Metadataextraction
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Semantic description
Create/Receive
Appraise/Select
Semantic Description
Ingest
Conceptualize Dispose
SA
FA: Enhancement Content Curation rates
Full lifecycle actions:
improvement
Results planning
◦ Setting a target: KISTI paper content curation (detection
rate) 30%
Improving results
◦ Monitoring paper content curation (detection rate) target
and management
Measuring results
◦ Measuring and checking the content curation (detection
rate) target
2.8 Korean Paper Collection (K-Paper) (7/9)
Full lifecycle Actions: FA
17
III. The KISTI Curation Case
3.1 Development of Elementary Technology for Curation (1/6)
Automatic metadata extraction
19
3.2 Development of Elementary Technology for Curation (2/6)
Metadata automatic extraction
Applying rule-based and machine-learning-based automatic metadata extraction technology
20
PDF structure analysis
Metadata extraction
through neural network
Rule-basedmetadata extraction
Metadata tagging
Metadata DB
XML for inspection
Neural network inputCONLL
Paper in PDF
Neural network input
data conversion
MetadataJATS-XML Output results
conversion
3.3 Development of Elementary Technology for Curation (3/6)
Automatic subject classification
Purpose: to develop a subject classification technology for academic papers and apply it to the curation service model.
Subject classification method: metadata (title, abstract, and keyword) are inputted. Noun clusters are created and keywords are classified.
Stages of subject classification: PDF -> noun extraction -> noun vectorization -> application of the cluster model -> application of in-depth learning -> classification
21
PDF paper Text extraction Noun extraction Embedding VectorGrouping
Subject SelectionCONLL formatting
Deep Learning multi encoder(CNN+RNN) basedSubject Classification Model
Results of subject
Metadata Mapping
3.4 Development of Elementary Technology for Curation (4/6)
Automatic reference identification
22
Reference information
CONLL formatting
Bi-RNN+CRF basedAutomatic reference identifier
Information of predicted results
Extraction results
3.5 Development of Elementary Technology for Curation (5/6)
Generating author/organization identification data
23
3.6 Development of Elementary Technology for Curation (6/6)
Personal information detector
Detects personal information in electronic documents and removes only parts containing personal information
24
Initial screen of the personal informationautomatic detector (client version)
Select electronic document for detection
Personal information detection and view detection results
Removing personal information
Personal information detection and removal report (in EXCEL)
Personal information detection and removal report (in EXCEL)
3.7 National R&D Reports (1/8)
25
Searches required information and provides results from the entire texts contained in
original reports.
Specialized search
Searches entire reports and provides non-textual results (such as table and
image) from reports
Non-textual search
Searches complete references of an original report.
Reference search
Analyzes keywords of the target digital report and shows the entire content of the original report in the form of a graph.
Keyword summary graph
Detects and removes personal information contained in digital reports.
Automatic personal information detector
Advanced search result (Keyword: stem cell)
3.7 National R&D Reports (2/8)
Advanced search service for original reports [Search > general search, advanced search]
26
Advanced search results
Search sections (title, abstract, background, Introduction, discussion,
and conclusion)
Search keywords
Downloading individual
search result reports
A comprehensive report of the search
results
Detailed results of advanced search
3.7 National R&D Reports (3/8)
Non-textual search in the original text of reports [Search > Non-textual search]
27
Non-textual search result (keyword: particulate matter)
Viewing the original report
containing images
Integrated download function for chosen
images
3.7 National R&D Reports (4/8)
References in original reports [Search > Reference search]
28
Reference search result (keyword: particulate matter)Selecting reference types
Report registration number isrequired for search inividualreports
keyword요약그래프 사용법
3.7 National R&D Reports (5/8)
Report content analysis and keyword summary [Search > Advanced search]
29
3.7 National R&D Reports (6/8)
30
① Run the automatic personal information detector and click ‘detect personal information’
② Click ‘search’ , and select and uploadelectronicdocument
③ Click ‘detect personal information’ and ‘removepersonal information’ to remove personal information
④ Download the personal information removal file and check the removal result
Detectpersonal
information
Removepersonal
information
search
Personal information
removalreport
Download individual personal information
removal files
Screen shot of electronic document with personal information removed
3.7 National R&D Reports (7/8)
Digitalized original reports are available to the general public on NDSL.
31
Digitalized reports available on NDSL
3.7 National R&D Reports (8/8)
Applied functions available, including report content search, text search, and image search in connection with project information.
32
Digitalized reports available on NTIS
Original content available through the connection to NDSL
Non-textual search function
Advanced search function