SMALL TEXTS FOR BIG DATA
Contributors L.Petit, S. de Amo, S.Bras !
Claudia Roncancio & Cyril Labbé Université Grenoble Alpes Laboratoire LIG France
1 AMW 2014 Alberto Mendelzon International WS
on Foundations on Data Management
From Cartagena de Indias... 2
3 Rentrée 2008 - 1A
to the French Alps…
Grenoble, capitale des Alpes 4
SMALL TEXTS FOR BIG DATA
Contributors L.Petit, S. de Amo, S.Bras !
Claudia Roncancio & Cyril Labbé Université Grenoble Alpes Laboratoire LIG France
5 AMW 2014 Alberto Mendelzon International WS
on Foundations on Data Management
Ubiquitous Information Systems
! Information Systems (IS) anywhere and anytime ! Content provided by and/or managed by mobile or
embedded devices ! Deployed in “offices” and in “real-world” environments ! Users and devices may move
! Large-scale distributed information systems ! Increased maturity of networks, hardware, data
management… ! Users expect appropriate information management
for “every thing”
6
Data sources in UIS
! Heterogeneous sources of persistent data ! DBMS, FS,…
! Heterogeneous sources of data streams ! Devices, sensors, application… Almost every thing ! A data stream is a (potentially unbounded) sequence
of data items
7
Data sources in UIS
! Heterogeneous sources of persistent data ! DBMS, DFS,…
! Heterogeneous sources of data streams ! Devices, sensors, application… Almost every thing ! A data stream is a (potentially unbounded) sequence of
data items “Transactional” data streams
" Stock exchange, web access, telecommunication, social data Measurement data streams
" Monitoring the evolution of entity states
! “Data never sleeps”… Big data!
8
Taking advantage of such data…
! Many information-based services ! Loosely-coupled cooperation between services ! User-centered point of view?
! Not so easy!
! Service personalization, ever and ever
9
This talk
! Information extraction customization ! To fit user’s current profile and interests ! To adapt content & form
! Personalization and context awareness ! User preferences and query answer customization ! Answers on available user devices.
10
This talk
! Information extraction customization ! To fit user’s current profile and interests ! To adapt content & form
! Personalization and context awareness ! User preferences and query answer customization ! Answers on available user devices
! Ad-hoc abstracts of data to facilitate stream data monitoring ! Contextual preferences
! Short texts which summarize (in natural language) the result of continuous complex data monitoring ! Shared in social networks ! Delivered to personal devices in various context // listen to summaries
while driving ! Facilitates monitoring, even for disabled users.
11
Data monitoring
12
Personalized summaries 13
Personalized summaries
Continuous queries & contextual preferences
14
Summaries in natural language
Running example: model & schema 15
Running example: continuous queries
[Q1] Every day, a summary concerning the stock Total over the last two days. [Q2] Every hour, a summary, over the last hour, for the category IT [Q3] Each hour, summary of the last hour, of the 100 transactions I prefer concerning category ’IT’.
16
Astral stream algebra
! Continuous and one-shot queries on persistent and real-time data ! streams and (temporal) relations
! Streamer Is(R) - stream of tuples inserted in R ! Windows (positional, temporal, cross) S[L] - last arrived tuple of a stream S S[N slide d] - sliding window of size N
17
Astral stream algebra
! Streamer Is(R) - stream of tuples inserted in R ! Windows (positional, temporal, cross) S[L] - last arrived tuple of a stream S S[N slide d] - sliding window of size N ! Joins and semi-sensitive Join
18
(Petit 2012)
Astral stream algebra - example
[Q1] Every day, a summary concerning the stock Total over the last two days.
19
Aggregations for summaries 20
21
Customization & preferences
! How to express user preferences? ! What do I prefer? ! User preferences may depend on the context… ! Any thing that can have an influence on the choice
Running example: continuous queries
[Q1] Every day, a summary concerning the stock Total over the last two days. [Q2] Every hour, a summary, over the last hour, for the category IT among the 100 transactions which most fulfill my preferences [Q3] Each hour, summary of the last hour, of the 100 transactions I prefer concerning category ’IT’.
22
Running example: user preferences
! Personalized ranking [P1] For stock options in category Commodities, Luc prefers a volatility-rate < 0.25. But for category IT, Luc prefers a volatility-rate > 0.35. [P2] For stock options with volatility > 0.35, Luc prefers those from Brazil than French ones. [P3] For stock options with volatility < 0.35, Luc prefers transactions involving at least 1000 shares.
23
Contextual preferences
When condition
is true
I prefer over
with attributes
equal (ceteris paribus)
U = P1(Y1), … , Pn(Yn) Pi unary predicate Yi not in W Eg. Yi > 5
Qi unary predicate Q1(X) � Q2(X) = ∅ Eg. (X < 3) > (X > 4)
24
CP – rules:
(De Amo 2011, Wilson 2004)
Contextual preferences to rank data
[P1] For stock options in category Commodities, Luc prefers a volatility-rate < 0.25. But for category IT, Luc prefers a volatility-rate > 0.35.
[P2] For stock options with volatility-rate 0.35, Luc prefers those from Brazil than French ones.
Formal expression cp-theory over the schema
T (SOName,Cat,Country, ETime, Rate, Method)
25
Preferences & the algebra
! One cp-rule φ , binary relation ! A cp-theory ! Induces a partial order ! Preferences on any data (streams or relations) ! Preference operators
! Best – non dominated tuples ! Kbest – top-k
26
(Petit 2013)
Data sample
D SOName Category Country ETime Rate Method d1 MS IT USA t1 0.30 M1 d2 AP IT India t2 0.55 M1 d3 USSteel Commodities USA t1 0.20 M2 d4 Petr4 Commodities Brazil t2 0.40 M2 d5 Bel5 Investments France t3 0.55 M2
27
Data sample
D SOName Category Country ETime Rate Method d1 MS IT USA t1 0.30 M1 d2 AP IT India t2 0.55 M1 d3 USSteel Commodities USA t1 0.20 M2 d4 Petr4 Commodities Brazil t2 0.40 M2 d5 Bel5 Investments France t3 0.55 M2
d2
d1
d3
d4
d5
φ2
φ1
φ3
28
Preferences & the algebra
! One cp-rule φ implies a partial order ! A cp-theory imply a partial order ! Preferences on any data (streams or relations) ! Preference operators
! Best – non dominated tuples ! Kbest – top-k
29
Personalized summaries
Summaries in natural language
30
Stream2text: Text generation
���������������� ��
��� �����
����������������
����������������
�������������������
�������������������
�������������������
�������������������
���������
�������������������������������
31
Natural language generation
! SimpleNLG, library used to generate grammatically correct sentences (Gatt&Reiter, U. Aberdeen)
! SimpleNLG-EnFr English & French (Vaudry, U. Montreal)
! Realiser for a simple grammar ! Orthography ! Morphology: handling inflected forms, gender, tense,
number or person. ! Ensuring grammatical correctness: enforcing noun-verb
agreement, creating well-formed verb groups…
32
(Gatt 2009)
Text generation - example 33
Dictionary of Concepts 34
Dictionary of aggregation functions 35
Transcription operator 36
Transcription operator
Text organization (microplanning) • Performed by the transcriptionist • One paragraph per Entity • One sentence per value of the structured summary
37
Example of summary 38
Conclusion and future work
! Personalized summaries for complex data monitoring ! Streams & persistent data ! Contextual user preferences ! Text summary in natural language (French & English) ! …Text streams!
! Textual summaries facilitate ! Sharing in social networks ! Access through mobile devices ! Using text-to-speech software
39
Conclusion and future work
! Experimentation (stock exchange, NBA) ! Assessing the text sentence aggregation ! Text updates
! Social streams - Sentiment analysis ! Analysis of small texts (phrases, tweets, SMS messages ! Identify a polarity (positive, negative or neutral) ! More work to summarize the global sentiments ! To give a (fair and non-biased) picture of the global
sentiments ! Complex events
! Production of texts referring to present and past situations
40