Download - IDENTIFICATION OF EMERGING TECHNOLOGIES USING …€¦ · IDENTIFICATION OF EMERGING TECHNOLOGIES USING ALGORITHMS OF BIG DATA SEMANTIC ANALYSIS Pavel Bakhtin Deputy Head of Information

IDENTIFICATION OF EMERGING TECHNOLOGIES USING ALGORITHMS OF

BIG DATA SEMANTIC ANALYSIS

Pavel Bakhtin

Deputy Head of Information and

Analytical Systems Unit

ISSEK HSE

National Research University Higher School of Economics

Institute for Statistical Studies and Economics of Knowledge (ISSEK)

XX April International Academic Conference on Economic and Social Development

Symposium “Foresight and Science, Technology & Innovation Policy”

April 10-12, 2019

Moscow

Ilya Kuzminov

Head of Information and


ISSEK HSE

OUTLINE

▪ Role of big data in technology foresight

▪ Problems of technology identification

▪ iFORA architecture

▪ Text mining pipeline

▪ Distributional semantic model (word2vec): term embeddings

▪ Ontology-controlled classification of documents using term

embeddings

▪ Case study

▪ Further research



2

ROLE OF BIG DATA IN TECHNOLOGY FORESIGHT

3



Forming pool of experts

Stage 2. Production,

validation and

dissemination of

Foresight results

Expert panels

Delphi survey

Expert validation of Delphi

System analysis, quantitative estimates

PR of the Forecast results

The use of the Forecast

Data collection and preprocessing Forecast model development

Stakeholders matrix

Bibliometric analysis

Setting expert panels

Requests to state (government) information systems

Set of preliminary analytic and forecast data

MethodologyModerated

discussionValidation of moderated

discussion resultsBrainstorming Model validation

Delphi questionnaire development Delphi questionnaire validation Delphi survey

Structured interview In-depth interview Expert panels

Trends

analysis

Threats and opportunities

analysis

Markets

analysisPolicy gaps analysis

Analysis of products

and technologies

Broad expert

discussionConferences and scientific

seminarsAnalytic report Information portal

Global processes modelling

Scenarios development

Data upload into external

informational systemsRecommendations for technology and innovation

policies

Crawling, scanning, accumulation of data

Step 1: Forming expert

and data bases for

Foresight

Steps requiring Big

Data tools

4



PROBLEMS OF TECHNOLOGY IDENTIFICATION

Expertise:

Expert knowledge alone is not enough

Scope of technology:

What do we call a technology?

Sources of data:

Patents, articles, media, other sources?

Bibliometrics:

Limitations of category and keywords

Patent analysis:

Limitations of IPC/CPC codes

Text mining term analysis:

Authors use different terminology

IFORA ARCHITECTURE

5

Natural Language Processing

(Machine learning and computational linguistics)

Elasticsearch

(NoSQL) database

Texts and metadata of documents and results of NLP

External STI ontologies

Identification and analysis of topics, trends, technologies, markets, centers of competence, etc.

Text segmentation and tokenization Lemmatization, stemming, morphological analysis

Named-entity recognition

Syntactic parsing

Co-occurrences analysis

Graph (NoSQL)

database

Dgraph

Embeddings of terms, documents, organizations, etc.

Distributional semantic

model

Documents Metadata Processing



Research

articles

Research

grants /

reports

International

conferences

Job vacancies Patents Analytical

reportsEducational

programmes

Legal

documents

Professional

media & social

networks

>350 mln documents ≈ 30 000 documents uploaded daily

word2vec

TEXT MINING PIPELINE

6

Text segmentation and tokenization

Lemmatization, stemming, morphological analysis

Syntactic parsing

Named-entity recognition

word klemmastempart of speech…

Co-occurrences analysis Entity 2Entity 1 Entity n



word 1word 2word k

terms

term

(organization)

entity

(organization)

sentence 1sentence 2sentence m

DISTRIBUTIONAL SEMANTIC MODEL (WORD2VEC):

TERM EMBEDDINGS

7

Term

"renewable

energy"

8 000 000 terms

8 000 000 neurons

200 neurons

Probability of co-

occurence with the term

“climate change”

Term "green energy"

Term "wind generator"

Term "solar panel"



200-dimensional vector space

Document 1 Document 2 Document … Document k

Term 1 Term 2 Term m

Averaging terms vectors

form document vectors

In such a space,

the vectors of similar terms

are close

ONTOLOGY-CONTROLLED CLASSIFICATION OF DOCUMENTS

USING TERM EMBEDDINGS

8



Document: Mining and summarizing customer reviews

Merchants selling products on the Web often ask their customers to review the products that they have purchased and the associated services. As e-commerce is becoming more and more

popular, the number of customer reviews that a product receives grows rapidly. For a popular product, the number of reviews can be in hundreds or even thousands. This makes it difficult for a

potential customer to read them to make an informed decision on whether to purchase the product. It also makes it difficult for the manufacturer of the product to keep track and to manage

customer opinions. For the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of

products. In this research, we aim to mine and to summarize all the customer reviews of a product. This summarization task is different from traditional text summarization because we only

mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. We do not summarize the reviews by selecting a

subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. Our task is performed in three steps: (1) mining product

features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the

results. This paper proposes several novel techniques to perform these tasks. Our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the

techniques.

STI ontologies

Automatic

summarization

Ontology entity

Data mining

Ontology entity

E-commerce

Ontology entity

Sentiment analysis

Ontology entity

Text mining

Ontology entity

Analysis of semantic similarity between document vector and entity vectors

CASE STUDY: MACHINE LEARNING 2012-2017

9



SEMANTIC MAP OF STI AREAS

CASE STUDY: MACHINE LEARNING 2012-2017

10



SEMANTIC MAP OF EMERGING TECHNOLOGIES

Technology area Growth

Cybernetics 8,83

Long short term memory 7,78

Convolutional neural

network 6,12

Autoencoder 5,31

Atmospheric model 3,15

Full spectral imaging 3,00

Deep learning 2,79

Optical imaging 2,65

Symmetric matrix 2,61

Pipeline transport 2,59

Network switch 2,57

Microscopy 2,50

Interference 2,50

Non-cooperative game 2,44

Big data 2,43

Graph energy 2,40

Video game design 2,40

Measurement 2,30

Convolutional code 2,27

Deep belief network 2,27

Restricted Boltzmann

machine 2,23

Spectrogram 2,22

Data modeling 2,21

Von Neumann stability

analysis 2,21

11



FURTHER RESEARCH

▪ Role-based classification of entities: technologies, grand

challenges, products, etc.

▪ Russian language ontology-controlled classification using term

embeddings

▪ Multi-language word embeddings for benchmarking

THANK YOU FOR YOUR ATTENTION



[email protected]

XX April International Academic Conference on Economic and Social Development

Symposium “Foresight and Science, Technology & Innovation Policy”

April 10-12, 2019

Moscow

Pavel Bakhtin

Deputy Head of Information and


ISSEK HSE

Ilya Kuzminov

Head of Information and


ISSEK HSE