QUERIE Collborative database exploration

QueRIE: Collaborative Database Exploration

Relational database users employ a query interface (typically, a web-based client) to issue a series of SQL queries that aim to analyse the data and mine it for interesting information.

First-time users may not have the necessary knowledge to know where to start their exploration.

Other times, users may simply overlook queries that retrieve important information.

In this work we describe a framework to assist non-expert users by providing personalized query recommendations.

Abstract

Web based browsers are used like Genome (http://genome.ucsc.edu/) Sky Server (http://cas.sdss.org/)

Personalized recommendations for keyword or free-form query interfaces

A multidimensional query recommendation system : address the problem of generating recommendations for data warehouses and OLAP systems

Recommendation based on past queries using the most frequently appearing tuple values.

Literature Survey

Literature Survey Continued...

1.Hive - A petabyte scale data warehouse using hadoop(A. Thusoo et al.):

Hadoop is a popular open-source map-reduce implementation which is being used in companies like Yahoo, Facebook etc. to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. Hive, an open-source data warehousing solution built on top of Hadoop.Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into map-reduce jobs that are executed using Hadoop.


2.QueRIE: A recommender system supporting interactive database exploration(S. Mittal, J. S. V. Varman):

This demonstration presents QueRIE, a recommender system that supports interactive database exploration. This system aims at assisting non-expert users of scientific databases by generating personalized query recommendations. Drawing inspiration from Web recommender systems, QueRIE tracks the querying behavior of each user and identifies potentially “interesting” parts of the database related to the corresponding data analysis task by locating those database parts that were accessed by similar users in the past. It then generates and recommends the queries that cover those parts to the user.


3.Amazon.com recommendations: Item-to-item collaborative filtering(G. Linden, B. Smith, and J. York):

At Amazon.com, we use recommendation algorithms to personalize the online store for each customer. The store radically changes based on customer interests, showing programming titles to a software engineer and baby toys to a new mother. There are three common approaches to solving the recommendation problem: traditional collaborative filtering, cluster models, and search-based methods. Here, we compare these methods with our algorithm, which we call item-to-item collaborative filtering. Unlike traditional collaborative filtering, our algorithm's online computation scales independently of the number of customers and number of items in the product catalog. Our algorithm produces recommendations in real-time, scales to massive data sets, and generates high quality recommendations.


Personalized queries under a generalized preference model(G. Koutrika and Y. Ioannidis):

In this paper, we present a preference model that combines expressivity and concision. In addition, we provide efficient algorithms for the selection of preferences related to a query, and an algorithm for the progressive generation of personalized results, which are ranked based on user interest. Several classes of ranking functions are provided for this purpose. We present results of experiments both synthetic and with real users (a) demonstrating the efficiency of our algorithms, (b) showing the benefits of query personalization, and (c) providing insight as to the appropriateness of the proposed ranking functions.


The basic idea behind this project is: To use user query log to analyze “session

summary” Based on session summary generate the

target tuples Generate recommended queries retrieving

target tuples. Re-ranking based on clarity scores

Proposed Solution

Architecture

Re-Ranking based on KL-

Diversion

Fragment Based Recommendation Session summaries Recommendation seed computation Generation of query recommendations

Query Processing Query Relaxation Query Parsing

Result Re-Ranking based on Clarity Scores K-L Diversion Method

Methodology/ Implementation Details

Session Summary: The session summary vector Si for a user i consists

of all the query fragments Ф of the user’s past queries.

Let Qi represent the set of queries posed by user i during a session

F represent the set of all distinct query fragments recorded in the query logs.

We assume that the vector SQ represents a single query Q Qi. For a given fragment Ф F, we define SQ[Ф] as a binary variable that represents the presence or absence of in a query Q.

Then Si[Ф] represents the importance of in session Si.

Fragment Based Recommendations

Recommendation seed computation: To generate recommendations, the framework

computes a “predicted” summary S captures the predicted degree of interest of the active user S serves as the “seed” for the generation of recommendations.

The predicted summary is defined as follows

“mixing factor” α [0, 1] that determines the importance of the active user’s queries

Using the session summaries of the past users and a vector similarity metric, we construct the (|F| x |F|) fragment-fragment matrix that contains all similarities sim(ρ,ф ) ρ,ф F.

Fragment Based Recommendations

Predicted Summary Computation

Once the predicted summary Spred has been computed, the top-n fragments Fn (i.e. the fragments that have received the higher weight) are selected.

Then all past queries Q, Q U Qi receive a rank QR with respect to the top-n fragments:

Generation of Query Recommendation

Because of the plethora of slightly dissimilar queries existing in the query logs, we decided to relax them in order to increase their cardinality, and thus the probability of finding similarities between different user sessions.

Query Relaxation

Query Parsing

Kullback–Leibler divergence Theorem:

KL divergence is a special case of a broader class of divergences called f-divergences. It was originally introduced by Solomon Kullback and Richard Leibler in 1951 as the directed divergence between two distributions. It can be derived from a Bregman divergence.

For discrete probability distributions P and Q, the KL divergence of Q from P is defined to be

Our Contribution...

In words, it is the expectation of the logarithmic difference between the probabilities P and Q, where the expectation is taken using the probabilities P. The KL divergence is only defined if P and Q both sum to 1 and if Q(i)=0 implies P(i)=0 for all i (absolute continuity). If the quantity 0 ln 0 appears in the formula, it is interpreted as zero because

For distributions P and Q of a continuous random variable, KL divergence is defined to be the integral:

where p and q denote the densities of P and Q.

Our Contribution...

More generally, if P and Q are probability measures over a set X, and P is absolutely continuous with respect to Q, then the Kullback–Leibler divergence from P to Q is defined as

where dp/dq is the Radon–Nikodym derivative of P with respect to Q, and provided the expression on the right-hand side exists. Equivalently, this can be written as

Our Contribution...

which we recognize as the entropy of P relative to Q. Continuing in this case, if Mue is any measure on X for which

and

exist, then the KL divergence from P to Q is given as

The alogarithms in these formulae are taken to base 2 if information is measured in units of bits, or to base e if information is measured in nats. Most formulas involving the KL divergence hold irrespective of log base.

Our Contribution...

Our Contribution...

The algorithm forms clusters in a bottom-up manner, as follows:

1.Initially, put each article in its own cluster.

2.Among all current clusters, pick the two clusters with the smallest distance.

3.Replace these two clusters with a new cluster, formed by merging the two original ones.

4.Repeat the above two steps until there is only one remaining cluster in the pool.

Thus, the agglomerative clustering algorithm will result in a binary cluster tree with single article clusters as its leaf nodes and a root node containing all the articles.In the clustering algorithm, we use a distance measure based on log likelihood. For articles A and B, the distance is defined as

Agglomerative Clustering Algorithm

The log likelihood LL(X) of an article or cluster X is given by a unigram model:

Here, cx(w) and px(w) are the count and probability, respectively, of word w in cluster X, and Nx is the total number of words occurring in cluster X.Notice that this definition is equivalent to the weighted information loss after merging two articles:

Where

To avoid expensive log likelihood recomputation after each cluster merging step, we define the distance between two clusters with multiple articles as the maximum pairwise distance of the articles from the two clusters:

Our Contribution...

where C1 and C2 are two clusters, and A, B are articles from C1 and C2 , respectively.Once a cluster tree is created, we must decide where to slice the tree to obtain disjoint partitions for building cluster-specific LMs. This is equivalent to choosing the total number of clusters. There is a tradeoff involved in this choice. Clusters close to the leaves can maintain more specifics of the word distributions. However, clusters close to the root of the tree yield LMs with more reliable estimates, because of the larger amount of data.We roughly optimized the number of clusters by evaluating the perplexity of the Hub4 development test set. We created sets of 1, 5, 10, 15, and 20 article clusters, by slicing the cluster tree at different points. A backoff trigram model was built for each cluster, and interpolated with a trigram model derived from all articles for smoothing, to compensate for the different amounts of training data per cluster. Then, the set of LMs that maximizes the log likelihood of the Hub4 development data was selected. Given a cluster model set LM={LMi} , the test set log likelihood was obtained as an approximation to the mixture-of-clusters model:

Our Contribution...

and P(LMi) and P(LMi | A) are the prior and posterior cluster probabilities, respectively.In training, A is the reference transcript for one story from the Hub4 development data. During testing, A is the 1-best hypothesis for the story, as determined using the standard LM.

Our Contribution...

Re-ranking based on clarity score

Reranking algorithms can mainly be categorized into two approaches: Pseudo relevance feedback and Graph-based reranking.

Pseudo relevance feedback approach display top results as relevant samples and then collects some samples that are assumed to be irrelevant.

Graph-based reranking approach usually follows two assumptions. First, the disagreement between the initial ranking list and the refined ranking list should be small. Second, approach constructs a graph where the vertices are images or videos and the edges reflect their pair wise similarities.

Our Contribution...

Thank You….

Date post:	13-Aug-2015
Category:	Engineering
Upload:	swami06
View:	21 times
Download:	3 times

QUERIE Collborative database exploration

Engineering