Date post: | 22-Apr-2015 |
Category: |
Technology |
Upload: | arjen-de-vries |
View: | 669 times |
Download: | 0 times |
May 31st, 2013 First SICSA MMI Information Retrieval Workshop
Looking beyond plain text for document representation in
the enterprise
Arjen P. de [email protected]
Centrum Wiskunde & InformaticaDelft University of Technology
Spinque B.V.
Outline
Motivation Mixed structured and unstructured
sources Search by strategy Equip Open ends
Enterprise Information Needs
Hang Li et al. A new approach to intranet search based on information extraction. CIKM’05
Strategic and business development needs
What funding schemes are the primary source of income? E.g., can we move to Europe when Dutch funding
dries up?
Who has active relations with partner X? “Valorisation”; new national funding requirements
What industry sectors do we depend upon? E.g., how many projects in smart cities? Green
energy? Cloud computing? Etc.
How are strategic decisions implemented? E.g., has objective “move from Telecom toward ICT”
been achieved, and how does it develop over time?
A week in the life
Date: Wed, 15 May 2013 15:14:49 +0200From: Theme Coordinator “INFORMATION”
To: Group Leaders Information ThemeSubject: List of company relations for internal CWI distribution
Dear Information Theme Group Leaders, The theme coordinators have been asked whether they: "een lijstje kan maken met de bedrijfscontacten en daarbij aan te geven van welke aard de contacten zijn".
Could you send me the names of Dutch companies you are currently working with or have worked with in the recent past by the end of Friday 17th May.
The Theme Coordinator
Date: Fri, 24 May 2013 11:33:04 +0200 From: Theme Coordinator Life Sciences
To: Group Leaders Life Sciences TeamSubject: Life Sciences: contacts with NL companies?
Dear all,
The CWI themes are currently collecting all contacts we have with Dutch industry and companies (but also hospitals and TNO etc.) in order to get an overview. I am doing this for the theme "Life Sciences". Can you please send me a list of your contacts with short description?
Life Sciences Theme Coordinator
From: Project Leader Project X Date: Sun, 26 May 2013 17:34:15 +0200
To: Project X Subject: [Project X: 33] @WP-leiders X-BeenThere: Project X @ Y.org
Beste WP-leiders,
Ik kreeg van Het Programma Management het volgende verzoek: > Mag ik je vragen me een lijstje te sturen van welk EU onderzoek en welk internationaal onderzoek er loopt bij de partners gerelateerd aan Project X (internationale inbedding).
Dit is mijn meest urgente punt. Kunnen jullie zsm aan mij sturen een lijstje met de volgende punten: - lijst van lopende EU projecten waarbij mensen uit jouw WP betrokken zijn; geef aub aan wi de partners zijn, financieringsbron, of het een STREP (of NoE of ...) is, en of jouw WP een participant of coordinator levert; - lijst van aangevraagde EU projecten, met zelfde extra's - lijst van eventuele andere internationale samenwerkingen die niet door een formeel project zijn afgedekt
Stuur me de lijstjes aub zsm maar niet later dan dinsdag 18u. Bedankt voor jullie hulp. De Projectleider
Surely, academia is not like…
The High Cost of Not Finding Info
If you employ 1000 knowledge workers: 50% of content unindexed $2.5
million/year
6.25% of effort is spent reproducing information that already exists $5 million/year
Knowledge workers spend 15-25% of their time on non-productive information-related activities
Feldman and Sherman.IDC Technical Report #29127, 2003
Butler Group Report: Enterprise Search and Retrieval. Oct-2006“many organisations are frittering away up to 10% of their staff costs on wasted effort because employees simply can’t findthe right information to do their jobs.”
So… “the real world”
“Real” companies (as opposed to academic institutions) attempt to address these information needs a priori, by setting up a Customer Relationship Management system (CRM)
Shan L. Pan and Jae-Nam Lee, "Using e-CRM for a unified view of the customer", Communications of the ACM 46(4) (2003): 95-99
However…
So-called “Professionals” are well known to focus on their own expertise
They do not have (or take) the time to maintain adequate descriptions of their network, skills, projects etc. – neither for most other types of “management overhead”
We only need to organize ourselves!!
Funding Proposals
Proposals submitted (are supposed to) pass by the faculty’s (TUD) “contract managers” or the institute’s (CWI) “project bureau” E.g., checks for liability, IPR and valid budget Proposal and (partial) metadata are added to
a content management system (CMS) The CMS used at my faculty at TUD is DECOS; a
few other faculties plan to use Microsoft Sharepoint; CWI deploys BSCW
Step 1
Index all the proposals submitted with your favourite IR system
Incompleteness
The DECOS metadata entered is usually incomplete from the start For many projects for example, only the coordinator is
entered as partner
Also, a proposal’s metadata does not reflect subsequent change; e.g., as in PuppyIR: People hired after funding secured Partner change when key person moved job Teams evolved Priorities shifted New tasks introduced and tasks (re-)assigned …
Incompleteness
In general: A project’s proposal or even the contract
seldomly represents the project’s exact future
Inaccuracy
Key information necessary for strategy & business development scenarios missing
Adding those is error-prone Infer domain (big data, green energy, cloud
computing, …) from keywords or content Extract names automatically Copy amounts manually; inconsistencies in
tables in proposal text are not uncommon
Incomplete & inaccurate Data
Ambiguity When describing domain, e.g., cloud
computing vs. clouds in environmental models
Names of people and companies involved Typos & OCR mistakes Entity resolution
Amounts of funding per partner, own contribution Funding request may not equal funding
received
The real world to rescue (1)
Not much work gets done without payments…
ERP
All large organisations deploy Enterprise Resource Planning (ERP) systems Typical modules include accounting, human
resources, manufacturing, and logistics ERP integrates the modules, data
storing/retrieving processes, and management and analysis functionalities
Baan, Oracle, PeopleSoft, SAP, …
More complete and more accurate data from ERP
Financial details of each project as executed Project leader People who are reimbursed from the project Exact duration of project activities ...
Step 2
Index all the ERP data with your favourite IR system
Link the ERP project identifiers to the CMS proposal identifiers Surprisingly, an n:m relationship…
DB +
The real world to rescue (2)
Institutional Repository
Publication metadata helps validate existing (and may even extend) the management info required: Authors Author affiliations Projects and funding schemes (from
acknowledgements)?
Again incomplete data though… Especially my faculty notoriously bad at
maintaining their part of the institutional repository
Step 3
Crawl the Institutional Repository using the Open Archives Initiative (OAI) harvesting protocol
Index all the publications data with your favourite DB + IR system
Relate projects to publications by author name, similar title, etc.
Result: Unified Access
Proposals from an XML dump of the CMS
Actual project administration from CSVs extracted from ERP
Publications crawled using OAI, from the IRP
Schema
Heterogeneous content!
BAAN-project (ERP) Decos-project (CMS) Decos-document (CMS attachments) Publication (Institutional Repository) Publication-document (Institutional Repository PDFs) Person (adress lists, ERP + CMS mentions) Company (CMS + ERP + document mentions) Subsidy (CMS) Department (address lists, CMS) Web addresses (extracted from documents) Topic (assigned to publications) Research programme (dependent on funding scheme)
Schema V2
How to search that graph???!
Rank (un-/semi-)structured data to deal with incompleteness & inaccuracies
Structured data representation for attributes including project revenu, people’s names, starting dates, etc.
Use cases varying from “expert search” to “data cleaning” and “visual analytics”
Search by Strategy
First, visually construct search strategies by connecting “building blocks”
Search by Strategy
First, visually construct search strategies by connecting “building blocks”
Next, generate the search engine specified by that search strategy
Strategies: DB+IR query plans
DatabaseSpinque: RDBMS (MonetDB)
BB1(in1,in2,in3, u1,u2)
in1 in2 in3
out
BB2(in1)
in1
out
• Data flowSpinque: strategy
• Query: strategy made operationalSpinque: PRA
CREATE VIEW a AS SELECT ..
CREATE VIEW b AS SELECT ..
CREATE VIEW c AS SELECT ..
Strategy
Relational DB
Probabilistic Relational AlgebraStrategy
Relational DB
• SQLexplicit probabilities
CREATE VIEW x AS SELECT a1, a3, 1-prod(1-prob) AS prob FROM yGROUP BY a1, a3;
• PRA: probabilistic relational algebra (Fuhr and Roelleke, TOIS 2001)
x = Project DISTINCT [$1,$3](y);
Rank by Text
Expert Finding
Search User Interface
Search results
Result List Interactions
Zoom in on item using “+”: Open item in left pane Shows results of item as query, using a
result-type specific search strategy Goal to provide contextually most related nodes
from underlying graph
Marking any item red/yellow/green for later usage
Browse by facet
Strategic and business development needs
What are our industry relations? Who of these partners collaborate with
more than one group? What funding schemes support these
collaborations?
Note: relations between partners and departments, edge strength represents revenue
Note: relations between partners and departments, edge strength represents revenue
Multi party relationsGrouping of external relations
ForeignUniv.
NL Univ.
Fundingagency
Public NL
Publicforeign
Privatesector
Multi party relationsGrouping of external relations
ForeignUniv.
NL Univ.
Fundingagency
Public NL
Publicforeign
Privatesector
Note: External relations with at least two departments; node size w.r.t. number of relations
Initial Findings
The integrated search helps improve recall, reducing the effort involved and leading to higher quality analyses
Many things that could be done even more automatically (albeit not perfectly) seem less important than expected We use very simple rules to extract URIs and
companies; no information extraction yet Information professional will always look into
results in detail
Open issues
Integrate visualization Idea: select result list and facet
Too many facets Idea: group facets
Result explanations Idea: describe path through graph
Entity support ++
Open issues
What strategy is good? Why? Idea: test using past usage data
What are the right user roles? Who should do the searches? Who should write strategies?
~ who writes the SQL queries in traditional DB?
Human in the loop for retrieval, but not yet for indexing…
Questions?