IDENTIFICATION OF EMERGING TECHNOLOGIES USING ALGORITHMS OF
BIG DATA SEMANTIC ANALYSIS
Pavel Bakhtin
Deputy Head of Information and
Analytical Systems Unit
ISSEK HSE
National Research University Higher School of Economics
Institute for Statistical Studies and Economics of Knowledge (ISSEK)
XX April International Academic Conference on Economic and Social Development
Symposium “Foresight and Science, Technology & Innovation Policy”
April 10-12, 2019
Moscow
Ilya Kuzminov
Head of Information and
Analytical Systems Unit
ISSEK HSE
OUTLINE
▪ Role of big data in technology foresight
▪ Problems of technology identification
▪ iFORA architecture
▪ Text mining pipeline
▪ Distributional semantic model (word2vec): term embeddings
▪ Ontology-controlled classification of documents using term
embeddings
▪ Case study
▪ Further research
National Research University Higher School of Economics
Institute for Statistical Studies and Economics of Knowledge (ISSEK)
2
ROLE OF BIG DATA IN TECHNOLOGY FORESIGHT
3
National Research University Higher School of Economics
Institute for Statistical Studies and Economics of Knowledge (ISSEK)
Forming pool of experts
Stage 2. Production,
validation and
dissemination of
Foresight results
Expert panels
Delphi survey
Expert validation of Delphi
System analysis, quantitative estimates
PR of the Forecast results
The use of the Forecast
Data collection and preprocessing Forecast model development
Stakeholders matrix
Bibliometric analysis
Setting expert panels
Requests to state (government) information systems
Set of preliminary analytic and forecast data
MethodologyModerated
discussionValidation of moderated
discussion resultsBrainstorming Model validation
Delphi questionnaire development Delphi questionnaire validation Delphi survey
Structured interview In-depth interview Expert panels
Trends
analysis
Threats and opportunities
analysis
Markets
analysisPolicy gaps analysis
Analysis of products
and technologies
Broad expert
discussionConferences and scientific
seminarsAnalytic report Information portal
Global processes modelling
Scenarios development
Data upload into external
informational systemsRecommendations for technology and innovation
policies
Crawling, scanning, accumulation of data
Step 1: Forming expert
and data bases for
Foresight
Steps requiring Big
Data tools
4
National Research University Higher School of Economics
Institute for Statistical Studies and Economics of Knowledge (ISSEK)
PROBLEMS OF TECHNOLOGY IDENTIFICATION
Expertise:
Expert knowledge alone is not enough
Scope of technology:
What do we call a technology?
Sources of data:
Patents, articles, media, other sources?
Bibliometrics:
Limitations of category and keywords
Patent analysis:
Limitations of IPC/CPC codes
Text mining term analysis:
Authors use different terminology
IFORA ARCHITECTURE
5
Natural Language Processing
(Machine learning and computational linguistics)
Elasticsearch
(NoSQL) database
Texts and metadata of documents and results of NLP
External STI ontologies
Identification and analysis of topics, trends, technologies, markets, centers of competence, etc.
Text segmentation and tokenization Lemmatization, stemming, morphological analysis
Named-entity recognition
Syntactic parsing
Co-occurrences analysis
Graph (NoSQL)
database
Dgraph
Embeddings of terms, documents, organizations, etc.
Distributional semantic
model
Documents Metadata Processing
National Research University Higher School of Economics
Institute for Statistical Studies and Economics of Knowledge (ISSEK)
Research
articles
Research
grants /
reports
International
conferences
Job vacancies Patents Analytical
reportsEducational
programmes
Legal
documents
Professional
media & social
networks
>350 mln documents ≈ 30 000 documents uploaded daily
word2vec
TEXT MINING PIPELINE
6
Text segmentation and tokenization
Lemmatization, stemming, morphological analysis
Syntactic parsing
Named-entity recognition
word klemmastempart of speech…
Co-occurrences analysis Entity 2Entity 1 Entity n
National Research University Higher School of Economics
Institute for Statistical Studies and Economics of Knowledge (ISSEK)
word 1word 2word k
terms
term
(organization)
entity
(organization)
sentence 1sentence 2sentence m
DISTRIBUTIONAL SEMANTIC MODEL (WORD2VEC):
TERM EMBEDDINGS
7
Term
"renewable
energy"
8 000 000 terms
8 000 000 neurons
200 neurons
Probability of co-
occurence with the term
“climate change”
Term "green energy"
Term "wind generator"
Term "solar panel"
National Research University Higher School of Economics
Institute for Statistical Studies and Economics of Knowledge (ISSEK)
200-dimensional vector space
Document 1 Document 2 Document … Document k
Term 1 Term 2 Term m
Averaging terms vectors
form document vectors
In such a space,
the vectors of similar terms
are close
ONTOLOGY-CONTROLLED CLASSIFICATION OF DOCUMENTS
USING TERM EMBEDDINGS
8
National Research University Higher School of Economics
Institute for Statistical Studies and Economics of Knowledge (ISSEK)
Document: Mining and summarizing customer reviews
Merchants selling products on the Web often ask their customers to review the products that they have purchased and the associated services. As e-commerce is becoming more and more
popular, the number of customer reviews that a product receives grows rapidly. For a popular product, the number of reviews can be in hundreds or even thousands. This makes it difficult for a
potential customer to read them to make an informed decision on whether to purchase the product. It also makes it difficult for the manufacturer of the product to keep track and to manage
customer opinions. For the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of
products. In this research, we aim to mine and to summarize all the customer reviews of a product. This summarization task is different from traditional text summarization because we only
mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. We do not summarize the reviews by selecting a
subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. Our task is performed in three steps: (1) mining product
features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the
results. This paper proposes several novel techniques to perform these tasks. Our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the
techniques.
STI ontologies
Automatic
summarization
Ontology entity
Data mining
Ontology entity
E-commerce
Ontology entity
Sentiment analysis
Ontology entity
Text mining
Ontology entity
Analysis of semantic similarity between document vector and entity vectors
CASE STUDY: MACHINE LEARNING 2012-2017
9
National Research University Higher School of Economics
Institute for Statistical Studies and Economics of Knowledge (ISSEK)
SEMANTIC MAP OF STI AREAS
CASE STUDY: MACHINE LEARNING 2012-2017
10
National Research University Higher School of Economics
Institute for Statistical Studies and Economics of Knowledge (ISSEK)
SEMANTIC MAP OF EMERGING TECHNOLOGIES
Technology area Growth
Cybernetics 8,83
Long short term memory 7,78
Convolutional neural
network 6,12
Autoencoder 5,31
Atmospheric model 3,15
Full spectral imaging 3,00
Deep learning 2,79
Optical imaging 2,65
Symmetric matrix 2,61
Pipeline transport 2,59
Network switch 2,57
Microscopy 2,50
Interference 2,50
Non-cooperative game 2,44
Big data 2,43
Graph energy 2,40
Video game design 2,40
Measurement 2,30
Convolutional code 2,27
Deep belief network 2,27
Restricted Boltzmann
machine 2,23
Spectrogram 2,22
Data modeling 2,21
Von Neumann stability
analysis 2,21
11
National Research University Higher School of Economics
Institute for Statistical Studies and Economics of Knowledge (ISSEK)
FURTHER RESEARCH
▪ Role-based classification of entities: technologies, grand
challenges, products, etc.
▪ Russian language ontology-controlled classification using term
embeddings
▪ Multi-language word embeddings for benchmarking
THANK YOU FOR YOUR ATTENTION
National Research University Higher School of Economics
Institute for Statistical Studies and Economics of Knowledge (ISSEK)
XX April International Academic Conference on Economic and Social Development
Symposium “Foresight and Science, Technology & Innovation Policy”
April 10-12, 2019
Moscow
Pavel Bakhtin
Deputy Head of Information and
Analytical Systems Unit
ISSEK HSE
Ilya Kuzminov
Head of Information and
Analytical Systems Unit
ISSEK HSE