Text Classification & Summarization
Kornél Markó, Florian Schmedding Averbis GmbH
Germany
FP7-ICT-2013-SME-DCA
Background
• WP 3: Provision of a Natural Language Processing Toolkit for processing legal documents in
– Bulgarian (IICT-BAS)
– English, French, German (Averbis)
– Italian (UNITO)
• Levels of linguistic pre-processing
– Sentence Splitting, Tokenization, POS-Tagging, Decompounding, Named Entity Recognition, Concept Mapping, Link Detection
• Text Classification and Text Summarization
2
Multilingual EUCases NLP Pipeline
3
Syllabus EuroVoc Geonames
1. Am 17. August 2011 erhob der Beschwerde-führer, ein mit einer …
Sentence detection
Tokenizer Stopword
Tagger
Stemmer Morpho-Semantic
Segmenter
Part-of-Speech Tagger
Chunker
Regular Expression Annotator
Concept Mapper
Concept Mapper
Concept Mapper
Textrank Descriptor Extraction
Text Summary
Multilingual Terminologies
4
Examples
5
Examples
6
Examples
7
• What are these documents about ? • Are they related to each other ?
Keyword Extraction
8
• What are these documents about ? • Are they related to each other ?
• Employer • Employment • Labour relations • Wage earner • Governance • Work
• Work • Work contract • Labour tribunal • Employer • Contract • Wage earner
Text summary
9
His previous contract of employment was not varied, because the second contract was made between different parties. The dispute relates to the interpretation and application of article 3(1) of Council Directive 77/187/EEC on the approximation of the laws of the Member States relating to the safeguarding of employees' rights in the event of transfers of undertakings, businesses or parts of businesses ("the Acquired Rights Directive"), which has now been consolidated with subsequent amendments and repealed by Council Directive 2001/23/EC. The TECs were to plan and deliver training and to promote and support the development of small businesses and self-employment within their area under contracts with the government. And it went further still when it ruled in para 2, drawing on its previous case law, that contracts of employment existing on the date of the transfer between the transferor and the workers assigned to the undertaking transferred are deemed to be handed over on that date from the transferor to the transferee regardless of what has been agreed between the parties in that respect.
NLP Backend
10
Text Rank (Principles)
• Graph-based Algorithm inspired by Google’s Page Rank
– Nodes are words (concepts)
– Edges represent relations to other words (concepts)
• Co-occurences within sentences in the document
• Iterate a graph-based ranking algorithm to give nodes a weight (counting vertices)
• Sort by the final score
11
12
Employer Employee
If an employer transferred his business undertaking to another party, the position at common law of an employee who worked for the first employer before the transfer and for the new employer after it was in principle clear.
Common Law
13
Employer Employee
Common Law
Contract
Employment
Replacement
His previous contract of employment was not varied, because the second contract was made between different parties. But the first contract was the subject of an express or implied novation, involving the termination of the first contract and its replacement by a new contract.
14
Employer Employee
Common Law
Contract
Employment
Rights
Legislation
But it could work disadvantageously to the employee in any situation where his rights depended on showing that his employment had been continuous for a given period, since a novation necessarily involved a discontinuity. It was this disadvantage which the legislation now under consideration was intended to obviate.
Replacement
15
Employer Employee
Common Law
Contract
Employment
Replacement
Rights
Legislation
Relationship Conditions
Justice
Judgment
But its effect is, inevitably, to introduce a fictional element into this tripartite relationship, since (where the legislative conditions are satisfied) the employee is treated as having been employed by the new employer all along and ex hypothesi such is not the case. The European Court of Justice [2005] IRLR 647 acknowledges this in para 43 of its judgment.
Text Rank
16
Employer Employee
Common Law
Contract
Employment
Replacement
Rights
Legislation
Relationship Conditions
Justice
Judgment
4
3
1
1
6
2
3
0 2
3 1
2
Text Summaries
• Based on the same principle…
– Recognize sentences
– Detect the most important words (concepts) in the document by using Text Rank
– Select n sentences containing these top terms, sort them by document positions
17
Evaluation
18
Preliminary Results: Summaries
19
Preliminary Results: Keywords
20
Conclusion & Discussion
• Text Rank is an effective and elegant way to compute „importance“ of terms.
– Language independent
– Unsupervised
• In addition, a machine-learning based approach for text classification is provided by UNITO
• Preliminary results are very encouraging
– Text Summaries: „very useful“ and „useful“
• > 80 % for de, en, fr
• > 60% it, bg
– Keyword Extraction (ongoing work):
• Further improvement necessary for all languages
• Different coverage for the languages (e.g. Eurovoc)
21
Thanks!
• Questions?
• Contact
22
D3.1 NLP Toolkit
Kornél Markó, Averbis
FP7-ICT-2013-SME-DCA
D3.1 NLP Toolkit
• Selection of Apache Unstructured Information Management Architecture (UIMA) as a framework for processing legal documents in Bulgarian, English, French, German, and Italian
• Provision of …
– language-specific analysis engines (modules) by partners: Sentence Splitting, Tokenization, POS-Tagging, NER, concept mapping, …
– wrappers for non-UIMA components
– a common typesystem and mappings to a universal tagset
24
NLP Toolkit: UIMA Framework
25
Analysis Engine
Input
Annotations
Input text, e.g. „Hello world!“
Text analysis task, e.g. sentence detection, POS-Tagging, etc.
Annotations about the text, e.g. Noun: Pos: 6,11 („World“)
Common Analysis Structure containing text and annotations using an unique type-system
Annotations
Input
CA
S
Bas
ics
Pip
elin
e
1. Am 17. August 2011 erhob der Beschwerde-führer, ein mit einer …
Sentence detection
Tokenizer Part-of-Speech Tagger
NLP Toolkit: EUCases Pipeline
26
LT Syllabus EuroVoc Geonames
1. Am 17. August 2011 erhob der Beschwerde-führer, ein mit einer …
Sentence detection
Tokenizer Stopword
Tagger
Stemmer Morpho-Semantic
Segmenter
Part-of-Speech Tagger
Chunker
Regular Expression Annotator
Concept Mapper
Concept Mapper
Concept Mapper
Textrank Descriptor Extraction
Text Summary
NLP Toolkit: Challenges & Solutions
• AVERBIS supports German, English, and French (UIMA compliant)
• UNITO has superior parsers for Italian
• IICT-BAS supports Bulgarian
Approach:
• Including UNITO‘s and IICT-BAS‘ components into the UIMA pipeline…
– … by wrapping them to UIMA analysis engines
– … makes concept mapping, descriptor extraction, and text summarization (language unspecific) available to Italian and Bulgarian documents
27
NLP Toolkit: Wrappers
• Send text to wrapped parser
• Lift custom annotations to common type system
28
Unito Sentence Detection
CAS annotation with type system
custom annotation
NLP Toolkit: Wrappers
29
Unito Sentence Detection
Unito POS tagger + tokenizer
IICT-BAS combined
parser
Sentences, Tokens, POS tags, lemmas
NLP Toolkit: EUCases Pipeline
30
1. Am 17. August 2011 erhob der Beschwerde-führer, ein mit einer …
Languagedetection
Sentence detection
Tokenizer Stopword
Tagger Stemmer
Morpho-Semantic
Segmenter
Part-of-Speech Tagger
Chunker
Sentence detection
Tokenizer Stopword
Tagger Stemmer
Morpho-Semantic
Segmenter
Part-of-Speech Tagger
Unito Sentence Detection
Unito POS tagger + tokenizer
IICT-BAS combined
parser
Sentences, Tokens, POS tags, lemmas
Morpho-Semantic
Segmenter
DE, EN
FR
IT
BG
LT Syllabus
EuroVoc
Geonames
Regular Expression Annotator
Concept Mapper
Textrank Descriptor Extraction
Text Summary
NLP Toolkit: Input & Output
• Akoma Ntoso XML documents delivered to pipeline
– Transform XML to plain text
– Not trivial because block elements must be separated but inline elements not
• Insert inline annotations into Akoma Ntoso
– Annotations refer to plain text positions
– Map plain text positions to corresponding text node of XML document
31
<akomaNtoso><p>Entscheidung</p><p><em>Sach</em>verhalt</p></akomaNtoso>
EntscheidungSachverhalt
Entscheidung Sach verhalt
Entscheidung Sachverhalt
Each element a linebreak
No linebreaks in document!