+ All Categories
Home > Documents > Information extraction from business documents with ... · CV information extraction Machine...

Information extraction from business documents with ... · CV information extraction Machine...

Date post: 14-Aug-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
14
Information extraction from business documents with machine learning
Transcript
Page 1: Information extraction from business documents with ... · CV information extraction Machine Learning Algorithms •Personal information •Skills •Education •Work experience

Information extraction from business

documents with machine learning

Page 2: Information extraction from business documents with ... · CV information extraction Machine Learning Algorithms •Personal information •Skills •Education •Work experience

Contents

Data and Text Mining 3

CV parsing project 7

Programming approach 11

Page

2

Page 3: Information extraction from business documents with ... · CV information extraction Machine Learning Algorithms •Personal information •Skills •Education •Work experience

Data and Text Mining Some definitions

• Text Mining is a special form of Data Mining, applied to “unstructured” texts (press agencies, web pages, e-mails, etc.) and more generally to any document corpus.

3

• Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database system.

Page 4: Information extraction from business documents with ... · CV information extraction Machine Learning Algorithms •Personal information •Skills •Education •Work experience

Text Mining Text Mining Workflow

4

• A Text Mining process is generally structured in four phases:

1. Data acquisition 2. Preprocessing 3. Modeling 4. Results validation

Page 5: Information extraction from business documents with ... · CV information extraction Machine Learning Algorithms •Personal information •Skills •Education •Work experience

Text Mining

5

Text Mining: preprocessing

• In the preprocessing phase, the linguistic analysis is performed and

all that is needed to arrive at a vector representation of the document is done. In particular: 1. POS tagging 2. Lemmatization/Stemming 3. Definition of stop-words 4. Dimensionality reduction 5. Meta-information integration 6. ...

Page 6: Information extraction from business documents with ... · CV information extraction Machine Learning Algorithms •Personal information •Skills •Education •Work experience

Text Mining

6

Text Mining: modeling

• In the modeling phase, the vectorized documents are subject to a

machine learning algorithm, specific to the target.

Page 7: Information extraction from business documents with ... · CV information extraction Machine Learning Algorithms •Personal information •Skills •Education •Work experience

CV parsing project Overview

Description

Aim • Information extraction

– Automatically analyze business documents as they flow through business communication channels

– Extract information from Italian unstructured documents

Data

• Textual data

– Test: CV

– To Be: enterprise documents for compliancy check

7

Page 8: Information extraction from business documents with ... · CV information extraction Machine Learning Algorithms •Personal information •Skills •Education •Work experience

CV information extraction Machine Learning Algorithms

• Personal information

• Skills

• Education

• Work experience

Combination of unsupervised and supervised classifiers to decide whether a piece of text represent a certain information or not

Information classes

8

• We use a combination of unsupervised and supervised methods to extract information from Italian unstructured documents.

CV

• Machine learning tools

Page 9: Information extraction from business documents with ... · CV information extraction Machine Learning Algorithms •Personal information •Skills •Education •Work experience

Classification Method Machine Learning Algorithms: details

• Every piece of text is tagged with different methods

• Every word is enriched with the information extracted by each method

Supervised neural network classifier

1st step 2nd step

Custom task features

Named Entity Recognition

Stanford NER tool

Word embeddings

Word2vec

• For every word is calculated the probability to represent a certain information (e.g. name, surname, skill,...)

• For each information class we chose the words with high probability

3rd step

Information class association

(e.g. threshold classifier)

9

• We apply a three step classification, where methods in the previous step creates the features for the classifiers of the following step.

Page 10: Information extraction from business documents with ... · CV information extraction Machine Learning Algorithms •Personal information •Skills •Education •Work experience

Machine Learning Kernel

Final Product

10

Business Document Analyzer

Frontend

Page 11: Information extraction from business documents with ... · CV information extraction Machine Learning Algorithms •Personal information •Skills •Education •Work experience

Programming approach Research and Industry approaches

• Ease of implementation

• Testing many different models

• Availability of scientific libraries

• Flexibility

Research Production

• Stability reliably running for long time

• Scalability to different amount of data

• Robustness to a range of different conditions

• Integration with the company infrastructure

11

Page 12: Information extraction from business documents with ... · CV information extraction Machine Learning Algorithms •Personal information •Skills •Education •Work experience

From R&D to production code Challenge of Building a Product

12

• Writing production code:

1. Engineering the product for expandability and maintainability 2. Chose the right tools and programming languages 3. Optimize (memory footprint and execution speed) 4. Adapt to specific hardware (e.g. GPU, Clusters) 5. Adapt to specific libraries (e.g. distributed computing libraries) 6. Integrate the entire product pipeline (data managing, web

interface, integration with third-parties services)

Page 13: Information extraction from business documents with ... · CV information extraction Machine Learning Algorithms •Personal information •Skills •Education •Work experience

Conclusion

Companies interests

Investments for applied

research projects

Publications &

Prototypes

Software Engineering

Finished products

13

The mission

Page 14: Information extraction from business documents with ... · CV information extraction Machine Learning Algorithms •Personal information •Skills •Education •Work experience

Niccolò Fava Software Developer

[email protected]


Recommended