PDF Association Technical Conference June 18-19 2013
PDF and Microsoft Sharepoint Hurdles to Overcome
Neil PitmanAquaforest Limited Version 1.120613
Objective PDF as a Sharepoint “First Class Citizen”
Agenda
Objectives
Sharepoint Overview
PDF Capture
PDF Search iFilters Handling Image and Mixed Mode PDFs
PDF Metadata Dictionary, XMP and Entity Extraction
Configuration Sharepoint 2010 , 2013
Summary
Sharepoint Overview
What is Sharepoint?
On-Premise and Cloud-based Collaboration & Document Management Platform
Origin - 2001
Usage Focus on MS Office Documents Typically distributed capture
Microsoft Sharepoint Server - 125 million licenses soldSharepoint to be a natural target for PDF storage
Sharepoint Overview
Sharepoint Editions (2010, 2013) Foundation Standard Enterprise
Office 365 / Sharepoint Online
Ecosystem Partner Products Office / Sharepoint Marketplace
Sharepoint Architecture Overview
MS Web-based (IIS)
MS Office Integration
SQL Server Storage
List or library data in a site collection is stored in a SQL Server database table, which uses queries, indexes and locks to maintain overall performance, sharing, and accuracy.
Filtered views with column indexes (and other operations) create database queries that identify a subset of columns and rows and return this subset to your computer.
Thresholds and limits help throttle operations and balance resources for many simultaneous users.
Privileged developers can use object model overrides to temporarily increase thresholds and limits for custom applications.
Administrators can specify dedicated time windows for all users to do unlimited operations during off-peak hours.
Information workers can use appropriate views, styles, and page limits to speed up the display of data on the page.
Microsoft Technology Stack Windows Server 2008/12 Internet Information Server (IIS) .Net Framework SQL Server MS Office
PDF Capture for Sharepoint
Options Sharepoint UI Acrobat XI Load Tools Custom Code Workflow & Event Receivers
WebRequest request = WebRequest.Create(destUrl);request.Credentials = CredentialCache.DefaultCredentials;request.Method = "PUT";byte[] buffer = new byte[1024];using (Stream stream = request.GetRequestStream())using (MemoryStream ms = new MemoryStream(fileBytes)){
for (int i = ms.Read(buffer, 0, buffer.Length); i > 0; i = ms.Read(buffer, 0, buffer.Length))
{stream.Write(buffer, 0, i);
}}
WebResponse response = request.GetResponse();response.Close();
Logging.Log("Upload successful");
Acrobat XI Sharepoint Integration
http://www.adobe.com/uk/products/acrobat/pdf-version-control-sharepoint-integration.html
PDF Search in Sharepoint -Overview
Item 1
Item 2
iFilter Architecture
iFilters scan documents for text and attributes – primarily in support of Microsoft Search technologies.
iFilter Configuration
Architecture
Code Sample
Suppliers
Issues
PDF Search in Sharepoint : iFilters
iFilter Explorer
iFilter Explorer
Using iFilters directly in Code
StringBuilder Buffer=new StringBuilder();string PDFFile = @"C:\dev\PDF
Conference\s.pdf";FilterCode f=new FilterCode();f.GetTextFromDocument(PDFFile, ref Buffer);Console.WriteLine(Buffer);
public void GetTextFromDocument(string Path, ref StringBuilderBuffer)
{IFilter filter = null;int hresult;IFilterReturnCodes rtn;
// Initialize the return buffer to 64K.Buffer = new StringBuilder(64 * 1024);
// Try to load the filter for the path given.hresult = LoadIFilter(Path, new IntPtr(0), ref filter);if (hresult == 0){
IFILTER_FLAGS uflags;
// Init the filter provider.rtn = filter.Init(
IFILTER_INIT.IFILTER_INIT_CANON_PARAGRAPHS |IFILTER_INIT.IFILTER_INIT_CANON_HYPHENS |IFILTER_INIT.IFILTER_INIT_CANON_SPACES |
IFILTER_INIT.IFILTER_INIT_APPLY_INDEX_ATTRIBUTES |IFILTER_INIT.IFILTER_INIT_INDEXING_ONLY,0, new IntPtr(0), out uflags);
if (rtn == IFilterReturnCodes.S_OK){
STAT_CHUNK statChunk;
// Outer loop will read chunks from the document at a
[DllImport("query.dll", SetLastError = true, CharSet = CharSet.Unicode)]
static extern int LoadIFilter(string pwcsPath,
[MarshalAs(UnmanagedType.IUnknown)] object pUnkOuter,
ref IFilter ppIUnk);
https://gist.github.com/jimschubert/1473904
iFilter TestBookmark
PDF Attachment
Text
Image/OCR Text
Annotation
XMP Metadata
Dictionary Metadata
iFilter Test Results
AdobeiFilter
PDFLibiFilter
FoxItiFilter
MicrosoftFormat Handler
Body Text Annotations
Bookmarks
Dictionary Metadata
XMP Metadata *
PDF Attachment
Dealing with Image and Mixed-Mode PDFs
Classify : Image-Only Born-Digital Part Image-Only, Part Born-Digital Previously OCRed
Dealing with Image and Mixed-Mode PDFs
Objectives: Ensure Full Searchability Avoid Text to Image Processing
Process : Capture Time? Scheduled In-Place?
PDF Metadata In Sharepoint
Text Search vs Metadata Search
Crawled vs Managed Properies
Review Requirements Dictionary Metadata XMP Metadata Entity Extraction
Consider Automation
PDF Metadata In Sharepoint
Crawled vs Managed Properies
PDF Metadata In Sharepoint : Using Event Receivers
Event Receivers can enable Metadata assignment
PDF Metadata In Sharepoint
Entity Extraction
Configuration Sharepoint 2010
Sharepoint 2013
Sharepoint 2010 PDF Configuration
Missing icon and iFilter
http://www.adobe.com/devnet-docs/acrobatetk/tools/AdminGuide/Acrobat_Reader_IFilter_configuration.pdf
Sharepoint 2010 PDF Configuration
Sharepoint PDF Configuration
Default for PDF : X-Download-Options: noopen' added to HTTP Response Header
Sharepoint 2013 and PDF Configuration
PDF Format Handler Support
Currently no iFilter Support for PDF !?!?!!
Inline Viewing PDF in Sharepoint 2013
http://stevemannspath.blogspot.co.uk/2012/10/sharepoint-2013-pdf-preview-in-search.html
Sharepoint 2013 and PDF Configuration
http://stevemannspath.blogspot.co.uk/2013/04/sharepoint-2013-pdf-support-and.html
Summary
Microsoft Sharepoint Server - 125 million licenses sold
Sharepoint to be a natural target for PDF storage
PDF as a Sharepoint “First Class Citizen”
Contact : [email protected]