JATS-CON 2012October 16, 2012
Faye KrawitzJennifer McAndrews
Richard O’KeeffeContent Technology Group, AIP
How Well Do You Know Your Data? Converting an Archive of Proprietary Markup Schemes to JATS: A Case Study
JATS-CON 2012October 16, 2012
• Founded in 1931• Umbrella organization for 10 physical science societies.
Combined membership totals 165,500 scientists, engineers and educators (with some overlap)• One of the world's largest non-profit publishers of scientific
information in physics. • Home of the Physics Resources Center• Publish 24+ AIP, member, partner journals/magazines, three of
which are co-published with other organizations, and one conference proceedings series• Mission: To inspire every Physical and Applied Scientist in the
world to turn to AIP for the information and help that they need
AIP at a glance
JATS-CON 2012October 16, 2012
3
The AIP Content Ecosystem• The AIP Content Collection
800,000 SGML/XML records encoded in AIP ISO 12083 “header” SGML DTD (1995-present) AIP ISO 12083 “full-text” SGML DTD (1995-2005) AIP “ISO-12083-informed” full-text XML DTD (2005-present)
•How was it used? XML the source for print/online PDFs The source for HTML rendered on the AIP online platform
And it worked well…but the times they were a changing
JATS-CON 2012October 16, 2012
4
What’s the problem…Why change?
• AIP-centric! XML overly specialized for specific AIP products Required proprietary systems and support Too many intermediary data transformations Limited the adoption of new technology and standards Too costly to maintain Not the XML format of choice for data recipients
JATS-CON 2012October 16, 2012
5
Redefining AIP’s future content strategy: If you could have anything you want…
Recognition that the intellectual property is the premium asset
Markup the data to maximize its value and enrichment potential
Keep current with industry standards Better meet client expectations!
Plan for success Streamlined production workflow Reorganize units to execute a unified content
strategy Not enough to realize the need to change, but to
follow through and execute
JATS-CON 2012October 16, 2012
6
C’mon…everybody does it! Standardization 1: adopt industry standard
XML Eliminate multiple formats and associated
transformations Enhanced data portability
Standardization 2: adopt XML technologies such as XSLT and Schematron
Minimize dependence on specialized applications and skill sets
Speak the same language as the STM Community
JATS-CON 2012October 16, 2012
7
JATS-CON 2012October 16, 2012
8
(Not so) Big Surprise!
Journal and Archiving Interchange Tag SetJATS
XSLT Schematron
JATS-CON 2012October 16, 2012
9
Build for Success: Communication
Make the plan known Keep everyone informed and updated
Get “buy-in” Ensure the whole organization understands
the change in approach Ensure the whole organization understands
the end goal Ensure the staff understands the important
role they play in the success
JATS-CON 2012October 16, 2012
10
Build for Success: Ownership Organize to succeed
Rethink and deploy an organization that most effectively achieves the goal
For AIP this meant… Create a unified team following the overall
strategy Foster a definitive sense of ownership for the
content as the “intellectual asset” Develop a clear chain of content responsibility Designate formal content “gatekeepers”
JATS-CON 2012October 16, 2012
11
Build for Success: Infrastructure
Invest in an up-to-date content management system Efficiently manage content, not have the product(s)
manage the systems Avoid unneeded workflow duplication Avoid unwanted “end-around” content manipulation Extensibility to adapt to future needs Excellent versioning capabilities Effective reporting tools
JATS-CON 2012October 16, 2012
12
Now What? Transform Decisions
Use XSLT Create “mapping specification” for the following:
– Transform AIP ISO 12083 “header” SGML DTD– Transform AIP “ISO-12083-informed” full-text XML DTD– On hold: AIP ISO 12083 “full-text” SGML DTD
Test and adapt based on results Quality Control including Schematron Document Train staff and production partners
JATS-CON 2012October 16, 2012
13
The Process Document Analysis Helpful aids
Existing documentation Institutional memory
Devise tagging principles Correct known ambiguities
JATS-CON 2012October 16, 2012
14
Document Analysis
•Identify: Consistencies Inconsistencies Surprises
•Evaluate tagging requirements•Create
Document Map (or “specification”) Sample XML files as needed
JATS-CON 2012October 16, 2012
15
Devised Tagging Principles•Strictly delineated element v. attribute•Defined AIP-specific usage of JATS •Treated <article-meta> as database-like•Avoided customized content models; reserved for later use•Reserved <x> markup for future use; use at transform as debugging tool•Reserved <named-content> for semantic enrichment markup
JATS-CON 2012October 16, 2012
16
Creating the Document MapTagging Principles x (Existing documentation + Institutional Memory) = JATS
X +
=
JATS-CON 2012October 16, 2012
17
Resulting Map (“spec”)ELEMENT AIP TAGGING JATS Action:metanote metanote/edcode
<metanote> Contributed by the Bioengineering Division of ASME for publication in the J<emph type="smallcap">OURNAL OF</emph> B<emph type="smallcap">IOMECHANICAL</emph> E<emph type="smallcap">NGINEERING</emph>. Manuscript received July 20, 2009; final manuscript received February 18, 2010; accepted manuscript posted March 1, 2010; published online June 18, 2010. Assoc. Editor: <techeditor status="associate">Ellen M. Arruda</techeditor>.</metanote>....</metanote
</article-meta><notes notes-type=”metadata-note”><p>Contributed by the Bioengineering Division of ASME for publication in the J<sc>OURNAL OF</sc> B<sc>IOMECHANICAL</sc> E<sc>NGINEERING</sc>. Manuscript received July 20, 2009; final manuscript received February 18, 2010; accepted manuscript posted March 1, 2010; published online June 18, 2010. Assoc. Editor: J. Shah.</p></notes>
1.Convert as <notes> with @notes-type=”metadata-note” 2.<notes> tag is placed after </article-meta> 3. Suppress tag, keep contents of: metanote/edcode, metanote/symposium, metanote/contribgrp 4. UPDATE:02/21 – wrap contents in <p> - this will not be in the source. Info: Okay tags below are suppressed:meta-received|meta-accepted|meta-revised|meta-presented|meta-submit|meta-published | meta-posted. ***N/A Future JATS***
JATS-CON 2012October 16, 2012
18
Corrected Known Ambiguities
Before After<extra1><suffix><extra2> <role><extra3><degree>
JATS-CON 2012October 16, 2012
19
Expected Trouble Spots
•Generated text•Style variation issues•Multi-purpose tags•Multimedia•Time
JATS-CON 2012October 16, 2012
20
Generated Text
The ability to take a tag like <ack> and output the title “ACKNOWLEDGMENTS” is the closest thing we have to magic.
JATS-CON 2012October 16, 2012
21
Style Variation Issues
INTRODUCTIONINTRODUCTION I. INTRODUCTION1. IntroductionIntroduction
JATS-CON 2012October 16, 2012
22
Mulitpurpose tagsThree distinct rules for handling one sgml element, all within References:
1. when <othinfo> is sibling of <refitem>:a. <othinfo> remove tag, retain PCDATAb. Retain content/punctuation and trailing spacec. MOVE retained PCDATA to before </mixed-citation> of preceding <mixed-citation>
2.When back/citation/ref/othinfo: Strip <othinfo>, retain PCDATA
3. NOTE: nesting of <othinfo> requires:<citation id="r#"><ref><biother><othinfo>…<othinfo><dformula> <ref><label>#. </label><note><p>….<disp-formula>…
JATS-CON 2012October 16, 2012
23
Multimedia1. <epaps>See supplementary material at <url href=”http://dx.doi.org/10.1063/1.3475476”>http://dx.doi.org/10.1063/1.3475476</url> <epapsid display="no" type=“multimedia">E-JAPIAU-108-032016</epapsid> for essential multimedia.</epaps>
2. <media id="v1" status="essential"><media-object doi="10.1063/1.3674301.1" file-name=“006029jcpv1.mpg" id="mm1" mime-type="video/mpeg" mm-type="video" version="original"><mediaref rids="v1" show-link="yes"></mediaref>
3. <media id="v1" status="essential"><media-object doi="10.1063/1.3674301.1" file-name="v1.mpg" id="mm1" mime-type="video/mpeg" mm-type="video" version="original"><mediaref rids="v1" show-link="yes"></mediaref>
4. <media id="v1" status="essential"><media-object doi="10.1063/1.3674301.1" file-name="v1.mpg" id="mm1" mime-type="video/mpeg" mm-type="video" version="original"><mediaref rids="v1" show-link="yes"></mediaref></media-object></media> <media id="v2" status="essential"><media-object doi="10.1063/1.3674301.2" file-name="v2.mpg" id="mm2" mime-type="video/mpeg" mm-type="video" version="original"><mediaref rids="v2" show-link="yes"></mediaref></media-object></media> <media id="v3" status="essential"><media-object doi="10.1063/1.3674301.3" file-name="v3.mpg" id="mm3" mime-type="video/mpeg" mm-type="video" version="original"><mediaref rids="v3" show-link="yes"></mediaref></media-object></media>.
JATS-CON 2012October 16, 2012
24
Time
JATS-CON 2012October 16, 2012
25
Unexpected Trouble Spots:Language
JATS-CON 2012October 16, 2012
26
Language
Deceptively simple example:
•Beforepacs
•After:front/spin/docanal/pacs
JATS-CON 2012October 16, 2012
27
Unexpected Trouble Spots: Nasty Surprises
JATS-CON 2012October 16, 2012
28
Nasty Surprises
Expected tagging:<p content-type="leadpara”>Weak signal detection possesses the potential application in many fields. By utilizing the sensitivity of the nonlinear system ...</p>
Displays online as:Lead ParagraphWeak signal detection possesses the potential application in many fields. By utilizing the sensitivity of the nonlinear system
Actual tagging:<p>Weak signal detection possesses the potential application in many fields. By utilizing the sensitivity of the nonlinear system ...</p>
No online display
JATS-CON 2012October 16, 2012
29
QUALITY CONTROL AND TESTING
• Prerequisite training• Content and tagging checks• Incorporating Schematron• Online displays
JATS-CON 2012October 16, 2012
30
QUALITY CONTROL AND TESTINGPrerequisite
Staff Training NLM/JATS DTD XPATH XSLT Schematron
JATS-CON 2012October 16, 2012
31
QUALITY CONTROL AND TESTINGContent and tagging checks
Step 1 – Preliminary Testing:
Performed while XSLT was in progress Analyst checked completed blocks of XSLT code and
confirmed programmers understanding of instructions Daily meetings held to discuss new findings or
clarifications of instructions
Trouble spot detected: specification document needed to be re-written using XPATH terminology.
JATS-CON 2012October 16, 2012
32Step 2 – Batch Processing
Performed when XSLT was complete. Converted and parsed approximately 200
files Investigated hidden problems and
determined if an XSLT modification or manual fix was the best course of action to take
JATS-CON 2012October 16, 2012
33
Step 3 – Group Testing
Performed when converted files were valid Ran approximately 200 files from various
journals with assorted article types Entire group checked same sample of files Check for dropped text Ran Schematron
JATS-CON 2012October 16, 2012
34Step 4 – Bulk Processing
Performed when all files were approved from the group testing
Entire corpus of content run with remaining errors resulting from bad source outliers
XSLT transformed over a 99% accuracy rate, with 800,000 there was still a large number to be inspected
Where applicable source or XSLT was fixed and files rerun
JATS-CON 2012October 16, 2012
35
Step 5 – Final Cleanup – Analyze flagged data.Investigated tags mapped in the XSLT to <x> or <strike> because the source tags had known problems.
JATS-CON 2012October 16, 2012
36
QUALITY CONTROL AND TESTINGIncorporating Schematron
Central piece in our QC process derived from
our pre-existing proprietary QC programs List of checks or assertions written in XPATH
language Tracks ERRORS and WARNINGS specific to our
data Done in parallel while XSLT was being written
JATS-CON 2012October 16, 2012
37
JATS MARKUP with SCHEMATRON ERROR DETECTED <kwd-group kwd-group-type="pacs-codes"><compound-kwd> <compound-kwd-part content-type="code">8440-x</compound-kwd-part> <compound-kwd-part content-type="value">Radiowave …</compound-kwd-part> <compound-kwd-part content-type="code">8440Ba</compound-kwd-part> <compound-kwd-part content-type="value">Antennas:. …</compound-kwd-part></compound-kwd> </kwd-group>
JATS MARKUP CORRECTED<kwd-group kwd-group-type="pacs-codes"><compound-kwd> <compound-kwd-part content-type="code">8440-x</compound-kwd-part> <compound-kwd-part content-type="value">Radiowave …</compound-kwd-part></compound-kwd> <compound-kwd> <compound-kwd-part content-type="code">8440Ba</compound-kwd-part> <compound-kwd-part content-type="value">Antennas:. …</compound-kwd-part></compound-kwd> </kwd-group>
JATS-CON 2012October 16, 2012
38
SCHEMATRON RULE<rule id="ERROR_COMPOUND_KEYWORD" context="compound-kwd"><assert role="ERROR_COMPOUND_KEYWORD" test="count(compound-kwd-part) = 2">[ERROR] A compound-kwd must have two compound-kwd-part tags</assert></rule>
<rule id="ERROR_COMPOUND_KEYWORD_PART" context="compound-kwd-part"><assert role="ERROR_COMPOUND_KEYWORD_PART" test="@content-type='code' or @content-type='value'">[ERROR] Invalid @content-type used for compound-kwd-part - allowable values are: code and value</assert> </rule>
JATS-CON 2012October 16, 2012
39
QUALITY CONTROL AND TESTINGOnline Displays
Assumptions at this point are: files are valid and Schematron runs clean
Testing was expanded to online publishing group and random testers throughout organization
Errors were found at this point that are apparent more in viewing
Great way to confirm that business rules are being followed
JATS-CON 2012October 16, 2012
40
LESSONS LEARNED &GENERAL CONCLUSIONS
• Don’t go it alone: follow industry best practices and standards• Set yourself up for success• It is impossible to overstate the importance of document analysis• Use analysis as an opportunity to correct known ambiguities• Recognize difference between bad and incorrect data • Create a detailed document map• XPATH training is valuable• Use Schematron as a central piece to QC process• Work as a team
JATS-CON 2012October 16, 2012
41
We chose to use pre-existing JATS DTD elements and avoid any JATS module customization. The stock NISO JATS was more than sufficient to accommodate AIP’s tagging needs. We were able apply our tagging principles and remain true to our business rules.
We have achieved the XMLquality we were aiming towards.
JATS-CON 2012October 16, 2012
42
Questions?