Enhance legal retrieval applications with an automatically induced knowledge base
Ka Kan Lo
Contents
Introduction Practice in legal retrieval Generation of Background concepts Combining concepts and contexts Conclusion
Introduction
Why needs advanced legal retrieval, e-discovery?
Document Collections Legal Requirements Efficiency
Introduction
What challenges?
Explosive growth of document size Extensive document source Expanding document format collection Informal language
Introduction
Opportunities:
Background contexts utilization Search documents deeply for every possible
evidence Examples – TREC: complaint as
background information More context information: Web and the links
Practice in Retrieval Process
TREC legal track practice:
Defendants devise queries Plaintiffs’ turns Final queries for production request Document Retrieved
Practice in Retrieval Process
What can be added to the process?
Exploit the background information – complaints
Merge with the larger background – Web and links
Proposal in this work – Use Wikipedia as an example
Modeling
Generation of Background concepts
Representation of Background concepts:
Entities & Relations Ease the conversion from texts to
concepts Facilitate unsupervised operations
Generation of Background concepts
Concepts sources – Wikipedia
Page: a document Title: central concept described by a
document Links: A set of concepts / terms to other
pages Word: Set of words
Generation of Background concepts
Facilitate lexical realization from texts to concepts:
Surface concepts: Mentioned by a page
Hidden concepts: Indexed by no pages but exist in pages
Generation of Background concepts
Entities:
Basic objects – named entities, locations, organizations ….
Definitions: e⊂c, e≠r, e∈role of relations
Generation of Background concepts
Relations:
Relationships between concept r⊂c, r≠e, r=<role1, role2, role3>, rolei = e
Semantical Domain
Semantical Domain:
Group of inter-related concepts, as defined by Wikipedians
Groups can be configured, reconfigured, depending on the size, nature of domains
Represent background information of different size, nature, structures
Semantical Domain
Operations:
D = {pagei} where pagei ∈ E Overlap Subsumed Join
Knowledge Extraction, Parsing
Parsing:
Conversion of syntactic parse into concepts representations
Dependency parsing Fill the entities and relations
automatically
Entities & Relations
Highlights of the process:
Syntactic parsing of sentences Conversion from linguistic
representation to concepts representation
Constraint the concept spaces by different sizes and scopes
Combining the concepts and background contexts
Algorithms:
Filter the background text and request text Match the term set into Wikipedia Build the network of concepts and relations Combine for single network and filter
unnecessary concepts Extract terms and concepts and expand the
query string Fire the query to retrieval
Conclusion
Conclusion
Challenges in legal retrieval Background contexts Generation of background concepts Project the context to concepts Expand the queries for retrieval
Conclusion
Current work: Integration of language learning (not only
parsing) and concepts generation process Large scale construction of networks with
full document set in 3 languages on Grid: English: 1.7 million Spanish: 300 thousand Chinese: 200 thousand
Conclusion
Current work: Experiments running on 20M web pages corpus for
expanded links Generated Language, Concept spaces used in
other Natural Language Technologies (NLT)
TREC-Legal: Testing the integration of knowledge base with the complaint text for queries
TREC-Legal: Building new matching mechanism (from KB induction) on small, concise set of documents
Thank you
QA