Ilian Uzunov (Georgi Georgiev): Ilian Uzunov (Georgi Georgiev)

Scaling Seman+c Technology to Increase User Engagement -‐ FT.com

September, 16th 2015

Ontotext, Scaling Semantic Technology #1 Sept, 2015

•  Introducing Ontotext •  Related Reads – a FT.com use case

•  What we managed to achieve

•  Hands on FT.com live

•  PosiHve signs across the news and media domain

•  Hands on NOW – News on the Web demo service

Outline


Why? enable be>er search, analy+cs and content delivery

What? data and content management technology graph database engine + text-‐mining solu+ons

How? seman+c analysis of text, linking text to data NoSQL database with inference

Best for: dealing with heterogeneous dynamic data

Clients: BBC, FT, Bloomberg, DK, AstraZeneca, Wiley, etc.

Facts: 70 staff; HQ in Sofia; sales in London & New York

USP: the best semanHc graph database engine text-‐mining pla[orm integrated with graph database

Company Brief


Sample RDF Graph: Data and Schema

#4 Sept, 2015

myData:Maria

ptop:Agent

ptop:Person

ptop:Woman

ptop:childOf

ptop:parentOf

rdfs:range

owl:inverseO

f

inferred

myData:Ivan

owl:relativeOf

owl:inverseOfowl:SymmetricProperty

rdfs:subPropertyOf

owl:inverseOf

owl:inverseOf

rdf:type

rdf:type

rdf:type

Ontotext, Scaling Semantic Technology

Interlinking Text and Data


Seman+c Annota+on

Ontotext, Scaling Semantic Technology #6

pmid:17714090

umls:C0035204

COPD

Bronchial Diseases

Respiration Disorders

umls:C0006261

Chronic Obstructive Airway Diseases

Asthma umls:C000496

Ian A Yang

Clinical and experimental pharmacology …

Sept, 2015

Technology PorTolio


Ontotext and Financial Times

Ontotext, Scaling Semantic Technology

Profile •  Top 3 business media •  Focused both on B2C publishing and B2B

services

Goals •  Create a horizontal pla[orm for both data

and content based on semanHcs and serve all funcHonality through it

Challenges •  CriHcal part of the enHre workflow •  MulHple development projects in parallel

with up to 2 months Hme between incepHon and go live

•  Horizontal pla[orm with focus on organizaHons, people, GPEs and relaHons between them

•  AutomaHc extracHon of all these concepts and relaHonships

•  Separate stream of work for a user behavior based recommenda+on of relevant content and data across the enHre media

#8 Sept, 2015

Serve relevant arHcles to increase user engagement

and improve usability

FT Primary Objec+ve


Subject: User Object: Ar+cle, Media Asset, Data, … AcHon: Read, Preview, Comment, …

Subject, Object, Ac+on


action

Contextual Recommenda+on


Contextual Similarity

Behavioural Recommenda+on


Behavioural Similarity

User Prof

ile

Contextual and Behavioural in Combina+on


Behavioural and

Contextual SimilarityReads

User Prof

ile

Average News Ar+cle Metadata


Article

NY

promoted (popular)

updated

created

image

summary

title

ID

URL

reads

views

votes

comments

FT Ar+cle Metadata


Summary

Title

body

editorial

img:alt

people

regions

organisations

IPTC

tags

Metadata Used


Summary

Title

body

editorial

img:alt

people

regions

organisations

IPTC

tags

concepts keyphrases

User Ac+ons


Limited to User reads ArHcle

reads

User Ac+ons: Another Perspec+ve


perform

comments

votes

posts

preview

read

contains leads to read

leads to preview

Article

Search Action

Result

Date

FTS Q. TagCat

Tag set

results

cattaxonomy

Search Log-----------------------------------------------------------------

•  Relies on the previous choices of an individual user (a user's profile)

•  Results on the basis of the similarity of items, defined in terms of their content

•  The recommended content is rather homogeneous

“Content”-‐based Recommenda+on


Two-‐fold scoring approach

•  Similarity to recently viewed arHcles (context)

•  Relevance to a long-‐term user profile –  Weights reflecHng the relaHve importance of the individual terms (staHc component)

–  TransiHon likelihoods among any pair of terms (dynamic component)

Content-‐based Ranking Mechanisms


•  Rely on staHsHcs that reflect the past choices of all users

•  Results based on user raHngs, and the similarity of users or items

•  Content-‐agnosHc •  Aware of the quality of content

Collabora+ve Filtering


Collabora+ve Ranking Mechanisms


User to Content Similarity Score

User to User Sim. Score

Content to Content Sim. Score

•  Combines both approaches to improve the quality of predicHon

•  Implemented via staHsHcal models

•  Takes a wide array of features into consideraHon

Hybrid Approach


Ini+al Architecture


Final Architecture


SOLR 1

SOLR 2

SOLR 3

CS Node 3

CS Node 1

CS Node 2

ReplicationGroup I

FT API

Fetch &Annotation

OWLIMWorker

RecommendationAPI

Varnish Cache

RR

RR

RR

Read

Article

1. get related

2. ask

4. query

3. on cache miss

1. pull content

2. annotate3. indexannotatecontent

storeuser

profiles

updatepopularity

click stream

update user

AWS INSTANCE

AWS INSTANCEAWS INSTANCE

AWS Elastic LB

1. Pull content – annotate/enrich – index

2. Accumulate/update user profile

3. Recommend

Main Ac+ons


Implementa+on Overview


Profile Update Request

(User ID, Item ID)

Query Generation Items Index (Solr)

Profile Storage

(Cassandra)

Recommendation Request (User ID)

Profile Update

User: - context - static component - dynamic component Article: - co-visitation matrix - popularity

Boosted sub-queries for all involved ranking schemes: content-based, collaborative, popularity, recency

•  8m named enHHes and metadata about them

•  20m labels of People and OrganisaHons

•  CES cluster which can be scaled horizontally to handle peak loads

•  Live dicHonary updates coming from GraphDB through the EUF (EnHty Update Feed) plugin

•  Max throughput -‐ 10 docs/sec on a single c3.2xlarge AWS node, mulHple by N to get an N nodes cluster throughput

•  Reliability has been 100%, but the soluHon hasn't been stressed as much as we've designed it for

Wrap up -‐ Concept Extrac+on Highlights


•  100% reliability in producHon for a full year (Ontotext also manages the deployment)

•  API handling 1,5m requests a day on average, up to 3m requests a day (1/3 recommendaHons, 1/3 logging user acHon, 1/3 checking whether a user has enough history to ask for behavioural recommendaHons)

•  Roughly 200m recommendaHons served and 200m user acHons tracked to day since go live

•  450 873 documents indexed

•  No caching, since everything is effecHvely a personalized search request

Wrap up -‐ Recommenda+on Highlights


•  GraphDB had to comply with a set of tests designed by FT and OT: Network lag, Disk Space, Disk Load, Less Memory, CPU Load, etc.

•  Comprehensive support for OWL and SPARQL

•  Efficient inference through the enHre life-‐cycle of the data

•  High-‐availability cluster architecture – proven and mature for more than 5 years now –  GraphDB first HA implementaHons works at BBC since 2010 –  Unmatched HA Tests and TransacHon load benchmarks

•  FTS and NoSQL Connectors for seamless integraHon

Wrap up – GraphDB Highlights


•  Washington Post tests new ‘Knowledge Map’ feature “Our ulHmate goal is to mine big data to surface highly personalized and

contextual data for both journalisHc and naHve content.”

•  New York Times RnD Lab announced an experimental project “Editor” 1) recognize a term that can be categorized, 2) link that enHty to exisHng

databases or microservices, 3) make this enriched informaHon accessible to journalists

•  BBC Structured Journalist Manifesto Structured journalism : 1) On the reporter side -‐ automaHon helps

improve a journalist’s reporHng and make it less cumbersome, 2) on the audience side semtech helps scale things that can improve the reader’s experience

Posi+ve Signs from the News Industry


Selec+on of Ontotext Customers


Thanks!


We will be delighted to have a word with you auer the session or later today or tomorrow!

•  Dr. Georgi Georgiev – Head of Ontotext Text Analysis Unit -‐ [email protected]

•  Ilian Uzunov – Sales Director CEMEAA -‐ [email protected]

•  Nikolay Krustev – GraphDB Sales Engineer -‐ [email protected]

Date post:	21-Jan-2018
Category:	Data & Analytics
Upload:	semantic-web-company
View:	687 times
Download:	1 times

Ilian Uzunov (Georgi Georgiev): Ilian Uzunov (Georgi Georgiev)

Data & Analytics