Looking beyond plain text for document representation in the enterprise

May 31st, 2013 First SICSA MMI Information Retrieval Workshop

Looking beyond plain text for document representation in

the enterprise

Arjen P. de [email protected]

Centrum Wiskunde & InformaticaDelft University of Technology

Spinque B.V.

mailto:[email protected]

Outline

Motivation Mixed structured and unstructured

sources Search by strategy Equip Open ends

Enterprise Information Needs

Hang Li et al. A new approach to intranet search based on information extraction. CIKM’05

Strategic and business development needs

What funding schemes are the primary source of income? E.g., can we move to Europe when Dutch funding

dries up?

Who has active relations with partner X? “Valorisation”; new national funding requirements

What industry sectors do we depend upon? E.g., how many projects in smart cities? Green

energy? Cloud computing? Etc.

How are strategic decisions implemented? E.g., has objective “move from Telecom toward ICT”

been achieved, and how does it develop over time?

A week in the life

Date: Wed, 15 May 2013 15:14:49 +0200From: Theme Coordinator “INFORMATION”

To: Group Leaders Information ThemeSubject: List of company relations for internal CWI distribution

Dear Information Theme Group Leaders, The theme coordinators have been asked whether they: "een lijstje kan maken met de bedrijfscontacten en daarbij aan te geven van welke aard de contacten zijn".

Could you send me the names of Dutch companies you are currently working with or have worked with in the recent past by the end of Friday 17th May.

The Theme Coordinator

Date: Fri, 24 May 2013 11:33:04 +0200 From: Theme Coordinator Life Sciences

To: Group Leaders Life Sciences TeamSubject: Life Sciences: contacts with NL companies?

Dear all,

The CWI themes are currently collecting all contacts we have with Dutch industry and companies (but also hospitals and TNO etc.) in order to get an overview. I am doing this for the theme "Life Sciences". Can you please send me a list of your contacts with short description?

Life Sciences Theme Coordinator

From: Project Leader Project X Date: Sun, 26 May 2013 17:34:15 +0200

To: Project X Subject: [Project X: 33] @WP-leiders X-BeenThere: Project X @ Y.org

Beste WP-leiders,

Ik kreeg van Het Programma Management het volgende verzoek: > Mag ik je vragen me een lijstje te sturen van welk EU onderzoek en welk internationaal onderzoek er loopt bij de partners gerelateerd aan Project X (internationale inbedding).

Dit is mijn meest urgente punt. Kunnen jullie zsm aan mij sturen een lijstje met de volgende punten: - lijst van lopende EU projecten waarbij mensen uit jouw WP betrokken zijn; geef aub aan wi de partners zijn, financieringsbron, of het een STREP (of NoE of ...) is, en of jouw WP een participant of coordinator levert; - lijst van aangevraagde EU projecten, met zelfde extra's - lijst van eventuele andere internationale samenwerkingen die niet door een formeel project zijn afgedekt

Stuur me de lijstjes aub zsm maar niet later dan dinsdag 18u. Bedankt voor jullie hulp. De Projectleider

Surely, academia is not like…

The High Cost of Not Finding Info

If you employ 1000 knowledge workers: 50% of content unindexed $2.5

million/year

6.25% of effort is spent reproducing information that already exists $5 million/year

Knowledge workers spend 15-25% of their time on non-productive information-related activities

Feldman and Sherman.IDC Technical Report #29127, 2003

Butler Group Report: Enterprise Search and Retrieval. Oct-2006“many organisations are frittering away up to 10% of their staff costs on wasted effort because employees simply can’t findthe right information to do their jobs.”

So… “the real world”

“Real” companies (as opposed to academic institutions) attempt to address these information needs a priori, by setting up a Customer Relationship Management system (CRM)

Shan L. Pan and Jae-Nam Lee, "Using e-CRM for a unified view of the customer", Communications of the ACM 46(4) (2003): 95-99

However…

So-called “Professionals” are well known to focus on their own expertise

They do not have (or take) the time to maintain adequate descriptions of their network, skills, projects etc. – neither for most other types of “management overhead”

We only need to organize ourselves!!

Funding Proposals

Proposals submitted (are supposed to) pass by the faculty’s (TUD) “contract managers” or the institute’s (CWI) “project bureau” E.g., checks for liability, IPR and valid budget Proposal and (partial) metadata are added to

a content management system (CMS) The CMS used at my faculty at TUD is DECOS; a

few other faculties plan to use Microsoft Sharepoint; CWI deploys BSCW

Step 1

Index all the proposals submitted with your favourite IR system

Incompleteness

The DECOS metadata entered is usually incomplete from the start For many projects for example, only the coordinator is

entered as partner

Also, a proposal’s metadata does not reflect subsequent change; e.g., as in PuppyIR: People hired after funding secured Partner change when key person moved job Teams evolved Priorities shifted New tasks introduced and tasks (re-)assigned …

Incompleteness

In general: A project’s proposal or even the contract

seldomly represents the project’s exact future

Inaccuracy

Key information necessary for strategy & business development scenarios missing

Adding those is error-prone Infer domain (big data, green energy, cloud

computing, …) from keywords or content Extract names automatically Copy amounts manually; inconsistencies in

tables in proposal text are not uncommon

Incomplete & inaccurate Data

Ambiguity When describing domain, e.g., cloud

computing vs. clouds in environmental models

Names of people and companies involved Typos & OCR mistakes Entity resolution

Amounts of funding per partner, own contribution Funding request may not equal funding

received

The real world to rescue (1)

Not much work gets done without payments…

ERP

All large organisations deploy Enterprise Resource Planning (ERP) systems Typical modules include accounting, human

resources, manufacturing, and logistics ERP integrates the modules, data

storing/retrieving processes, and management and analysis functionalities

Baan, Oracle, PeopleSoft, SAP, …

More complete and more accurate data from ERP

Financial details of each project as executed Project leader People who are reimbursed from the project Exact duration of project activities ...

Step 2

Index all the ERP data with your favourite IR system

Link the ERP project identifiers to the CMS proposal identifiers Surprisingly, an n:m relationship…

DB +

The real world to rescue (2)

Institutional Repository

Publication metadata helps validate existing (and may even extend) the management info required: Authors Author affiliations Projects and funding schemes (from

acknowledgements)?

Again incomplete data though… Especially my faculty notoriously bad at

maintaining their part of the institutional repository

Step 3

Crawl the Institutional Repository using the Open Archives Initiative (OAI) harvesting protocol

Index all the publications data with your favourite DB + IR system

Relate projects to publications by author name, similar title, etc.

Result: Unified Access

Proposals from an XML dump of the CMS

Actual project administration from CSVs extracted from ERP

Publications crawled using OAI, from the IRP

Schema

Heterogeneous content!

BAAN-project (ERP) Decos-project (CMS) Decos-document (CMS attachments) Publication (Institutional Repository) Publication-document (Institutional Repository PDFs) Person (adress lists, ERP + CMS mentions) Company (CMS + ERP + document mentions) Subsidy (CMS) Department (address lists, CMS) Web addresses (extracted from documents) Topic (assigned to publications) Research programme (dependent on funding scheme)

Schema V2

How to search that graph???!

Rank (un-/semi-)structured data to deal with incompleteness & inaccuracies

Structured data representation for attributes including project revenu, people’s names, starting dates, etc.

Use cases varying from “expert search” to “data cleaning” and “visual analytics”

Search by Strategy

First, visually construct search strategies by connecting “building blocks”

Search by Strategy

First, visually construct search strategies by connecting “building blocks”

Next, generate the search engine specified by that search strategy

Strategies: DB+IR query plans

DatabaseSpinque: RDBMS (MonetDB)

BB1(in1,in2,in3, u1,u2)

in1 in2 in3

out

BB2(in1)

in1

out

• Data flowSpinque: strategy

• Query: strategy made operationalSpinque: PRA

CREATE VIEW a AS SELECT ..

CREATE VIEW b AS SELECT ..

CREATE VIEW c AS SELECT ..

Strategy

Relational DB

Probabilistic Relational AlgebraStrategy

Relational DB

• SQLexplicit probabilities

CREATE VIEW x AS SELECT a1, a3, 1-prod(1-prob) AS prob FROM yGROUP BY a1, a3;

• PRA: probabilistic relational algebra (Fuhr and Roelleke, TOIS 2001)

x = Project DISTINCT [$1,$3](y);

Rank by Text

Expert Finding

Search User Interface

Search results

Result List Interactions

Zoom in on item using “+”: Open item in left pane Shows results of item as query, using a

result-type specific search strategy Goal to provide contextually most related nodes

from underlying graph

Marking any item red/yellow/green for later usage

Browse by facet

Strategic and business development needs

What are our industry relations? Who of these partners collaborate with

more than one group? What funding schemes support these

collaborations?

Note: relations between partners and departments, edge strength represents revenue

Note: relations between partners and departments, edge strength represents revenue

Multi party relationsGrouping of external relations

ForeignUniv.

NL Univ.

Fundingagency

Public NL

Publicforeign

Privatesector

Multi party relationsGrouping of external relations

ForeignUniv.

NL Univ.

Fundingagency

Public NL

Publicforeign

Privatesector

Note: External relations with at least two departments; node size w.r.t. number of relations

Initial Findings

The integrated search helps improve recall, reducing the effort involved and leading to higher quality analyses

Many things that could be done even more automatically (albeit not perfectly) seem less important than expected We use very simple rules to extract URIs and

companies; no information extraction yet Information professional will always look into

results in detail

Open issues

Integrate visualization Idea: select result list and facet

Too many facets Idea: group facets

Result explanations Idea: describe path through graph

Entity support ++

Open issues

What strategy is good? Why? Idea: test using past usage data

What are the right user roles? Who should do the searches? Who should write strategies?

~ who writes the SQL queries in traditional DB?

Human in the loop for retrieval, but not yet for indexing…

Questions?

Date post:	22-Apr-2015
Category:	Technology
Upload:	arjen-de-vries
View:	669 times
Download:	0 times