HMM model for search

HIDDEN MARKOV MODELS IN SEARCH APPLICATIONS

CONTEMPORARY CONCERNS STUDYTERM VI, PGP 2012-2014HIDDEN MARKOV MODELS IN SEARCH APPLICATIONSHARSH [email protected] [email protected]. U DINESH [email protected] Supervisor

26TH FEBRUARY, 2014

EXECUTIVE SUMMARY

TABLE OF CONTENTS

INTRODUCTION1.1 SEARCH ENGINES - THE CONTEXT: The evolution of internet and related technological products and services has been concomitant with the growth in web-based data. Consumer behaviour on the internet is now primarily driven by searching through this data in order to locate relevant information. A Search Engine is an interface to consumers, who use it as an information retrieval system. A search engine mines through massive amount of data in order to present the most relevant results to the user in the least amount of time possible.1.2 IMPLEMENTATION OF SEARCH ENGINES: A user typically provides a string, referred to as the search query, that describes the content he is looking for and the search engine returns the user a list of results that match the user's query. The set of results that are shown to the user are pre-processed and ranked in order to make sure that the most relevant result is shown to the user early in the list. The ranking methodologies derive from probability models, account for factors such as popularity of search item & user's search history, and in some instances also incorporate user feedback.1.3 DAWN OF THE SEARCH BASED APPLICATIONS: The growth in web-based data led to the emergence of search-based software applications, where the core functionality provided to the user derives from a search-engine backbone. Examples of such applications are: Wikipedia, Imdb, Blogspot, Reddit, Bloomberg, etc. Even though not all web-applications are primarily driven by search, each and every one of them necessarily provides a search functionality. The search-based application that this paper will focus on, is the electronic-commerce website.1.4 CRUCIALITY OF SEARCH TO ELETRONIC COMMERCE APPLICATIONS: With the proliferation of e-commerce websites, web businesses are finding it increasingly difficult to differentiate amongst themselves. Profitability in retail is now driven by bringing as many users as possible to the website and then converting them into active customers. The raison d'tre of e-commerce applications is to allow users to locate the product they want and allow them to buy it. The act and experience of locating the right product is extremely crucial to the buying experience and is responsible for a) Boosting sales, b) Converting visitors into customers, c) Increasing hit-rate and d) Establishing brand loyalty. Thus, a relevant and efficient search-engine is critical to the success of any electronic-commerce application.1.5 NATURAL LANGUAGE QUERIES - A CAN OF WORMS: Traditionally search engines only allowed users to express queries in a strict syntax. Lately, search engines not only allow queries to be descriptive, but also allow them to be expressed as natural language. With free-form search queries, consumers find convenience in expressing what they are looking for. However, this functionality poses a host of new problems for search engine implementation, viz. spelling mistakes, phonetic as well as semantic mistakes, phrasal errors, etc. Now, search engines not only have to look for results that match a user's query but also worry about what the user is most likely searching for given that he can type incorrect search queries.1.6 QUERY MODIFICATION - FOCUS OF THIS PAPER: Academic research in building better search engines has focused on two areas - a) Relevance of search results and b) Speed of result delivery. The concern of this paper is the former, i.e. how can search engines show the most relevant search results to users given that the user can provide natural language queries, complete with possible errors. The problem of relevance is expansive and this paper will focus on the sub-problems of suggesting users corrections or modified queries as they type a certain search query and showing the most relevant results after a user has typed and entered the query. The research problem is described in detail in the next section.PROBLEM STATEMENT & SIGNIFICANCEThe research problem for this paper can be stated as building an algorithm for search engines to provide dynamic, relevant and ranked query suggestions. The search functionality task-flow on any website, which this paper intends to model, can be broken down into two separate phases. The first phase comprises the duration when the user is typing in the search query and the second phase involves showing the results to the user.2.1 PHASE I - DYNAMIC AUTOSUGGEST: This phase spans over the time the user types into the search box. The primary task in this phase is to process the character keys entered by the user and suggest him search terms that match the product or item that the user is most likely looking for. This task can again be broken down into the following two separate functionalities.2.1.1 Spelling Correction: The spell-check problem has evinced considerable amount of interest (from as far back as 1957) and accounts for significant amount of research. There exist highly evolved algorithms that can address the three main components of a spelling correction problem: a) Missed characters, b) Misspelled words, and c) Combining broken words.2.1.2 Auto-Complete Suggestion: The auto-complete problem assumes that the string typed by the user is correct and then predicts and suggests the most probable terms that follow the typed string. One of the most popular algorithms used for solving this problem is the Levenshtein Algorithm that uses Edit Distances as means to determine how dissimilar two strings are in order to arrive at a completion suggestion. Auto-Complete suggestions can be used to complete incompletely typed words as well as propose complete phrases that the user might be looking to type. At this point, it is critical to note the difference between a spelling-correction suggestion and an auto-complete suggestion. While the former assumes that the user could have entered an incorrect query and proposes corrections, the latter assumes that the user entered the query correctly and then proposes most probable complete words and phrases. This dichotomy is used later in the paper in order to build integrated search results.The research problem, thus, can be broken down into the aforementioned phases. The algorithm that this paper aims to devise should be able to aggregate the results of spell-check & auto-complete to make suggestions in Phase I.2.2 PHASE II - AFTER-TYPING SEARCH RESULTS: Once the user is done typing the search query, whether following the suggestions made in Phase I or not, he or she presses the return character to be led to the search page. The problem now can be characterized as - what is the most probable search term that the user intended to type given the actually typed string. This problem is critical as it determines the relevance of the results eventually shown to the user.The key problem in this phase is to determine if the search-engine should follow an approach different from the one used in Phase I, in order to determine the modified or unmodified search query that it should show results for.LITERATURE REVIEWRastilav Sramek (2007)[endnoteRef:1] presents an elaborate review of Hidden Markov Models and the Viterbi algorithm, while it focuses on the space complexity of determining the most probable state sequence, given a sequence of observations. The vanilla Viterbi algorithm takes O(n) space in determining the most probable state path taken by the model. The author modifies the algorithm by introducing checkpointing and convolutional coding and demonstrates that the memory requirement can be driven down to O(1) to O(n) depending upon the data input pattern. In the following paragraphs we summarize the HMM and Viterbi concepts by presenting them in an algorithmic manner. [1: Rastislav Sramek; The Online Viterbi Algorithm; Comenius University, Bratislava, 2007; Link; Accessed on 4th Feb, 2014]

3.1 HMM OVERVIEW: The use-case of a HMM is quite similar to the use-case of posterior probability distributions, in the sense that we try to predict what was the real reason behind an observed outcome. HMM, eponymously, includes system states that are hidden or unobserved. Any Hidden Markov Model can be defined by the following 5 attributes:HMM = {S, , T, E, }Where, S = Set of hidden states which the system can take at any point of time = Set of observations or system outputT = Transition Probabilities, i.e. Tij = P(Current State = i | Previous State = j)E = Emission Probabilities, i.e. Eij = P(Observation = i | State = j) = Initial Probability Vector, i.e. initial state distributionIn order to determine the probability that the model generates a particular sequence of observations, we can employ two different techniques, discussed below. These techniques are then used for developing the Viterbi algorithm.3.2 FORWARD PROBABILITY & ALGORITHM: Forward probability is the probability that the model generates an observed sequence X1, X2.....Xi, such that the ending state of the system is j.Fij = P(Last State = j | Sequence generated = X1, X2.....Xi)Then, F(i+1)j = (k=1..m) Fik * Tkj * Ej(i+1)Using the Forward Algorithm, the problem of determining the probability that a particular sequence was generated can now be reduced to the summation of probabilities of generating that sequence and ending in a particular state, across all states:P(X) = (i=1..m)Fni3.3 BACKWARD PROBABILITY & ALGORITHM: Backward probability is the probability that the system generates a sequence Xt+1,Xt+2. . . Xn, given that the system is in state i at time t.Bij = P(Future Sequence generated = Xt+1,Xt+2. . . Xn | State = i at time t)Then, Bij = (k=1..m) B(j+1)k * Tik * EijAnalogous to the forward algorithm, the Backward Algorithm reduces the problem of determining the probability of generation of a sequence to the following summation:P(X) = (i=1..m)B1i+ i3.4 VITERBI ALGORITHM: Moving over the probability of a state path, given a sequence of observations, the Viterbi Algorithm is used to determine the most probable state path, given a sequence of observations. The algorithm can be broken down into two phases3.4.1 Phase I: Calculate the state path, with maximum probability, that ends in a particular state. Do this for each and every possible state.3.4.2 Phase II: Use the state paths determined in the first phase to determine the sequence of states that lead to the most probable state path.3.5 ONLINE VITERBI ALGORITHM: The space complexity of plain Viterbi is O(m*n), where m is the size of the observation set and n is the size of the state space. The author compresses this mapping or matrix in the form of a tree in order to minimize the space complexity of the algorithm. For the purpose of this paper, it is prudent to treat this modification as out-of-scope. However, high-performance and dynamic implementations can significantly benefit from this algorithm.3.6 SPELLING CORRECTION: Spelling correction has been of academic interest for a very long time. In the initial period, the focus has primarily been on identifying and correcting the non-word errors that are mistakenly misspelt. A simple strategy to identify such words was to use a lexicon and calculate distances using any of the prevalent distance measure algorithms. Levenshtein's (1966)[endnoteRef:2] distance is one widely used distance measure. Lexicon sizes used to be very small. [2: Levenshtein, V I. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady, 10(8), 707-710, 1966.]

Statistical generative models introduces in 1990s revolutionized the field of spell correction. In statistical generative model, the two major components were identified as critical: n-gram language model and error model. An improved statistical error model was proposed by Brill and Moore[endnoteRef:3] (2000) that increased both the accuracy and efficiency of the generative model. The need for better error models and language models were thus identified that created interest among many academicians to research further on the subject. One major problem in building error model suggested by Brill et al. was that it requires a large number of manually maintained word-correction pairs thus proved very costly. By leveraging web, Whitelaw et al. (2004)[endnoteRef:4] overcame this problem. They were able to use web to automatically find the word-correction word pairs that could be stored hence. [3: E. Brill and R. Moore. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, 2000.] [4: C. Whitelaw, B. Hutchinson, G. Chung, and G. Ellis. Using the web for language independent spellchecking and autocorrection. In EMNLP, pages 890899. ACL, 2009.]

Invention of web and search engine services has drawn much more attention in the field of spell correction. Search engine queries, particularly, have been of tremendous interest among academicians and corporate researchers alike. Focus shifted to newer challenges that either did not exist or were not of much importance till then. For example, a word may be legitimate but inappropriate in the context of the query. Apart from insertion or omission of characters, the words may have been incorrectly split or concatenate. To address these issues, multiple studies by Cucerzan & Brill (2004)[endnoteRef:5], Ahmad & Kondrak (2005)[endnoteRef:6], and Chen, Li & Zhou (2007)[endnoteRef:7], make suggestions such as large web corpora and query log which could prove to be very helpful. Sun et al. (2010)[endnoteRef:8] provided a method to train phrase-based error model using click through data. Gao et al. (2010)[endnoteRef:9] introduced ranker based model that provides flexibility to utilize additional features of the correction words to arrive at the best correction string. However, there had been limited progress to fix errors of concatenation and splitting types. [5: S. Cucerzan and E. Brill. Spelling correction as an iterative process that exploits the collective knowledge of web users. In EMNLP, 2004.] [6: F. Ahmad and G. Kondrak. Learning a spelling error model from search query logs. In HLT/EMNLP. The Association for Computational Linguistics, 2005.] [7: Q. Chen, M. Li, and M. Zhou. Improving query spelling correction using web search results. In EMNLP-CoNLL, pages 181189. ACL, 2007.] [8: X. Sun, J. Gao, D. Micol, and C. Quirk. Learning phrase-based spelling error models from clickthrough data. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 10, pages 266274, Stroudsburg, PA, USA, 2010.] [9: J. Gao, X. Li, D. Micol, C. Quirk, and X. Sun. A large scale ranker-based system for search query spelling correction. In C.-R. Huang and D. Jurafsky, editors, COLING, pages 358366. 2010.]

In addition to spell correction, a broader topic of research emerged that focussed on altering and refining the queries. Query refinement tries to improve the search query string in order to produce better results based on the search intent. Xu & Croft (1996)[endnoteRef:10] explain that in Query expansion, the query is expanded to add additional terms to improve query formulation. Tan & Peng (2008)[endnoteRef:11] as well as Li, Hsu, Zhai & Wang (2011)[endnoteRef:12] show that Query segmentation is a process, where the query is segmented into two or more meaningful sub-queries. Li et al. (2012)[endnoteRef:13] provided generalized hidden markov model that provides better algorithm to correct errors like concatenation and splitting. [10: Jinxi Xu and W. Bruce Croft. Query expansion using local and global document analysis. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 96. ACM, New York, NY.] [11: B. Tan and F. Peng. Unsupervised query segmentation using generative language models and wikipedia. In Proceeding of the 17th international conference on World Wide Web, WWW08, pages 347356. 2008.] [12: Y. Li, B.-J. P. Hsu, C. Zhai, and K. Wang. Unsupervised query segmentation using clickthrough for information retrieval. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, SIGIR 11, pages 285294. 2011.] [13: Yanen Li, Huizhong Duan and ChengXiang Zhai. A Generalized Hidden Markov Model with Discriminative Training for Query Spelling Correction]

3.7 NOISY CHANNEL MODEL: Noisy channel model is a probabilistic model widely used as a basis for most of the spell checker algorithms. It is a framework used to identify the correct word given a noisy observation containing errors. Given an input Query Q, our goal is to find the best correction C*.

Applying Bayes Theorem,

The Noisy Channel model consists of two parts: Language Model (LM) and Error Model (EM).3.7.1 Language Model: Language model tells us that how likely would a given word or proposed correction be a logical word in the lexicon. It assigns the probability to the sequence of words. Language model is usually linked with a lexicon that contains that a very large collection of the contextually logical words. In the above equation, P(C) represents the language model.3.7.2 Error Model: Error model or Channel model tells us the probability of observing a query string given a word. Thus, error model models the channel that converts the word into the misspelling. In the above equation, P(Q|C) represents the error model.3.8 GENERALIZED HIDDEN MARKOV MODEL: Li et al. (2012) provided a Generalized Hidden Markov Model (gHMM) for query spelling correction. In this paper, the authors have addressed two major issues in the spelling correction. First, the existing algorithms are not efficient in recognizing important spelling errors like concatenation and splitting. Secondly, due to complex form of scoring functions the solutions are not always optimal.Table 1 below summarizes the common kinds of spelling mistakes that are observed in a search engine and categorizes them across three buckets. Error Type Example Correction

In-WordInsertion esspresso espresso

Deletion vollyball volleyball

Substitution comtemplate contemplate

Mis-use capital hill capitol hill

Cross-Word Concatenationintermilan inter milan

Splitting power pointpowerpoint

Table 1: Common Mistakes in Search QueriesThe gHMM is a single stage solution that takes query as an input and provides ranked corrections in the form of list. The parameters are trained with discriminative method with labelled spelling examples. The gHMM has two significant advantages over HMM. First, theoretically in HMM, there can be large number of states since a phrase can be chosen arbitrarily. The gHMM attaches a state type with each of the state. The state type indicates the type of correction operation and thus reduces the number of search spaces effectively. Secondly, it uses feature functions to parameterize the probability of state sequences. This helps in mapping the transition and emission probabilities with small set of parameters.Lets say, for a given query the sequence of states is defined as. A context is also defined as In addition, gHMM also defines the feature vector to measure the interdependency of adjacent states. The feature factors are defined as The log probability of a state sequence and its corresponding types is proportional to:

The best state sequence is given by:

The model is trained using discriminative approach on the query spelling correction. The algorithm is reproduced in Appendix 12.1. The algorithm for decoding Top K-corrections is reproduced below.The results obtained with gHMM are very promising. In terms of effectiveness, a comparison with noisy channel model demonstrates that gHMM is able to construct a more complete set of candidate solutions. In terms of efficiency, the results of gHMM and noisy channel model are of the same order. 3.9 LARGE SCALE RANKER BASED SYSTEM: Gao et al. (2010) extended the noisy channel model to add three breakthrough features. First, a ranker based speller is introduced that allows multiple features to be incorporated. Second, a distributed infrastructure is used to train and apply web based n-gram LMs. Since, the language style of search queries are generally different that of a text book, the web corpora text streams are supposed to provide better results. Finally, a new phrase-based error model is proposed that places probability distribution over transformations between multi-word phrases. The probabilities are estimated using query-correction pairs which are in turn derived from search logs.3.9.1 Ranker-Based Speller: Ranker-based speller provides a flexibility to incorporate multiple features that help in spelling correction. For example, it can check whether a candidate correction appears in the Wikipedia title in which case is probability may be higher in comparison to others. To improve efficiency and provide flexibility, the seller system operates in two stages: a). Candidate Generation and b). Re-ranking.In candidate generation, a list of most probable candidates is retained for each of the tokenized sub-queries. Among these, 20 best candidates are picked using decoder as per the equations:

A standard two-pass algorithm is used. First, viterbi algorithm is used to find the best correction. Second, A-star algorithm is used to find the most probable corrections. For re-ranking, the set of features are used to rank the most probable corrections obtained in candidate generation step.3.9.2 Web Scale Language Models: The probability of a word string is defined by n-gram LM as:

Since the language style for a search query is very different from that of web document (Zhang et al., 2006; Brants et al., 2007), web n-gram LM collection is used. 3.9.3 Phrase-based Error Models: The phrase-based error model process sequence of words rather than individual words. This provides contextual information to the error model. In this way, the multi-term phrase is more effective since it captures the inter-dependencies of the various words. The results have shown that: The ranker-based speller outperforms noisy channel model significantly in terms of accuracy, precision and recall. Also, the accuracy of speller can be further improved by employing more sophisticated rankers (non-linear rankers etc.). Web LM and phrase-based error models are able to further improve the performance significantly. The improvement is additive in nature.APPROACHAs described in the previous sections, we treat the problem in its broken down state and try to integrate the solutions of each phase & sub-phase. For the Phase I problem of spelling correction, we look towards the HMM based Viterbi algorithm to arrive at spelling corrections. The auto-complete suggestions are tackled using a Backward HMM model that predicts the top most probable search terms that should follow given the typed string. Once we have these two result sets, we try to integrate the two set of suggestions in order to arrive at a master set that produces the set of most relevant suggestions for the user.Once the user enters a search string, no matter whether he uses the suggestions or not, Phase II kicks in and then we try to solve the problem of presenting the most relevant set of search results, again using the HMM model.The following use-case, not only instantiates the problem statement, but also outlines our approach.

Search Term: San

ViterbiSpell-CheckBackward HMM Auto-Complete

CorrectionsSamPanSenTenTanSabSuggestionsSansuiSansui MobileSansui CellphoneSanyoSanjay DuttSangeeta Bijlani

Result IntegrationAlgorithm

Integrated SuggestionsSamsungSamsung MobileSamsung ChargerSansui CellphoneSansui TVSanyo Player

Figure 1: Phase I Task FlowThe Phase II approach is under discussion. From a bird's eye view the problem is not different from the auto-suggest problem in case of Phase I, but the fact that the user entered the string despite the suggestions, needs to have a bearing on the set of results that are eventually displayed to him or her. One approach is to take the user typed string for granted and display all products matching that description. However, this approach ignores the fact that the user might have typed something that is different from what he is actually looking for. This again builds the case for detecting the most probable state sequence or going back to the HMM models we used in Phase I. This decision to use the most probable state sequence should be based on the number & popularity of relevant products that come up as results to the actually typed search query.SPELL-CHECK & AUTOCOMPLETE AGGREGATION* Explain multiple ways to do it - a)Simple results merge, b) Pure ranking based & c) Conditional probability based* Explain the conditional probability approach and why we choose it as the solution. Talk about how conditional probability performs aggregation.MODELING CONDITIONAL PROBABILITY RANKING SYSTEM*Bayes theorem model, complete with all the equations* Feature list for ranking, etc.

ALGORITHM & PSEUDOCODEProcess: Spell CheckStep 1: Tokenize the query string into sequence of terms.Step 2: For each term, get the list of probable candidates from lexicon based on the edit distance of the candidate. The edit distance must be less than a threshold value, say t.Step 3: Use Viterbi algorithm and A*algorithm to identify 20 best candidates.Input: SanOutput: Sab, Sam, Son etc.Process: BHMM on Probable CorrectionsStep 4: Now, for each and every probable correction,find the most probable complete words and or phrases that the user might be looking to type. This will lead us to a list of words and phrases that we want to suggest the user. Ex: Sanyo, Sanyo Walkman, Samsung, Samsung Mobiles.Step 3: Conditional Probability Ranking - Use the conditional probability approach (outlined earlier on the whiteboard)to rank all the words and phrases that come out of step 2.Step 4: Suggest - Choose the top 20 words/phrases and suggest them to the user.Step 5: Feedback - Update the conditional probability tables when the user does or does not take a suggestion.

Step 2: BHMM on Probable Corrections -Now, for each and every probable correction,find the most probable complete words and or phrases that the user might be looking to type. This will lead us to a list of words and phrases that we want to suggest the user. Ex: Sanyo, Sanyo Walkman, Samsung, Samsung Mobiles Suck.Step 3: Conditional Probability Ranking - Use the conditional probability approach (outlined earlier on the whiteboard)to rank all the words and phrases that come out of step 2.Step 4: Suggest - Choose the top 20 words/phrases and suggest them to the user.Step 5: Feedback - Update the conditional probability tables when the user does or does not take a suggestion.

MOCK DATAIn this section, we shall define simple mock data that we shall use in the next section to illustrate our search algorithm. 8.1 Language Dictionary: Our language dictionary contains following words:SAMSONITE

SAMSUNG

SANSUI

SANYO

Table XXX: Language Dictionary8.2 HMM Parameters: Next, we have defined the Hidden Markov Model Parameters like Initial probabilities, transition probabilities and emission probabilities.Initial Probability:AMNOS

P(xn,1) 0.30.10.10.10.4

a) Initial ProbabilitiesAMNOS

P(xt|xt-1)A0.10.30.20.10.3

M0.40.20.10.10.2

N0.40.20.10.10.2

O0.40.20.10.10.2

S0.50.100.40

b) Transition ProbabilitiesAMNOS

P(xt|qt)A0.70.10.10.10

M0.10.70.10.10

N0.10.30.50.10

O00.10.10.70.1

S0.10.2000.7

c) Emission ProbabilitiesTable XXX: HMM Parameters8.3 Backward HMM Probabilities: We have also simulated the backward HMM probabilities to keep illustration simple. P(SANYO|SAN) 0.3

P(SANSUI|SAN) 0.7

P(SAMSUNG|SAM) 0.8

P(SAMSONITE|SAM) 0.2

Table XXX: Backward HMM Probabilities8.4 Tag List: Tag list is as provided in the below table:Electronics

Brand

Popular

Table XXX: Tag List8.5 Query-Tag Matrix and Suggestion-Tag Matrix: Finally, we shall create dummy matrices for query-tag and suggestion-tag associations.ElectronicsBrandPopular

SAN774

a) Query-Tag MatrixSANYOSANSUISAMSUNGSAMSONITE

Electronics9991

Brand6587

Popular4796

b) Suggestion-Tag MatrixTable XXX: Association Matrices

DRY RUN9.1 Input Query: To illustrate the algorithm, lets take a sample query string SAN. Hence, we have observed variables q1, q2 and q3 as S, A and N respectively.

9.2 Forward-Backward Probabilities: The ft,t+1 values have been calculated and provided below.f1,2

x2

AMNOS

x1 A0.0177440.0057940.0038630.0019310

M0.0473180.0025750.0012880.0012880

N00000

O00000

S0.8280690.01802600.0721040

f2,3

x3

AMNOS

x2A0.0418050.3762440.4180490.0418050.041805

M0.0132710.0199070.0165890.0033180.013271

N0.0066360.0099540.0082950.0016590.006636

O0.0106170.0159260.0132710.0026540.010617

S00000

Table XXX: Fordward-Backward probabilities9.3 Most Likely State Sequences: We shall run the algorithm to get two most likely state sequences. The same method can be used recursively to get other state sequences as well.First Most Likely State Sequence: From f1,2(x1,1,x1,2), the maximum probability is observed as 0.828 which is observed for x1,1 = S and x1,2 = A. From f(x1,2, x1,3), the maximum probability for x1,2=A is observed as 0.418 which gives x1,3=N. Hence, the first most likely state of sequence is given by:

The probability for x1 is 0.1384.Second Most Likely State Sequence: For second most likely sequences, we need to create subsets M.

Using ft,t+1 values and algorithm, the maximum probability of each subset is:M10.047318231

M20.828069046

M30.376244194

Hence the maximum probability is observed as 0.8281 for M2 for the sequence x1 = S and x2 = A. x3 can be calculated using f(x2,2, x2,3). X2,3 is found to be M. Thus,

The probability for x1 is 0.1246.9.4 Conditional Probabilities: Next, the conditional probabilities are calculated by the multiplication of backward HMM probabilities and Spell-Check probabilities calculated above.Spell Check ProbabilityBHMM ProbabilityConditional probability

P(SAN|SAN)0.1384P(SANYO|SAN) 0.3P(SANYO|SAN) 0.04152

P(SANSUI|SAN) 0.7P(SANSUI|SAN) 0.09688

P(SAM|SAN)0.1246P(SAMSUNG|SAM) 0.8P(SAMSUNG|SAN) 0.09968

P(SAMSONITE|SAM) 0.2P(SAMSONITE|SAN) 0.02492

Table XXX: Conditional probability Calculations

9.5 Degree of Association: Degree of association is calculated by vector multiplication of query-tag matrix and suggestion-tag matrix. The results are provided in the table below.SANYOSANSUISAMSUNGSAMSONITE

SAN12112615580

Relative Degree of Association0.2510.2610.3220.166

Table XXX: Degree Of Association9.6 Moderated Probabilities: Using degree of Association values, the conditional probability values are updated and provided in the table below in descending order of probabilities.SuggestionProbability

P(SAMSUNG|SAN) 0.032054772

P(SANSUI|SAN) 0.025325477

P(SANYO|SAN) 0.010423071

P(SAMSONITE|SAN) 0.0041361

Table XXX: Moderated ProbabilitiesCONCLUDING REMARKS

SCOPE FOR FURTHER RESEARCH

APPENDIX12.1 DISCRIMINATIVE TRAINING & TOP-K DECODING IN GHMM

REFERENCES

Date post:	20-Oct-2015
Category:	Documents
Upload:	mani-gupta
View:	25 times
Download:	5 times

HMM model for search

Documents