+ All Categories
Home > Documents > MarkLogic_BigData

MarkLogic_BigData

Date post: 09-Dec-2015
Category:
Upload: basavaraj
View: 8 times
Download: 1 times
Share this document with a friend
Description:
MarkLogic_BigData
Popular Tags:
64
Copyright © 2013 Accenture All Rights Reserved. Accenture, its logo, and Accenture High Performance Delivered are trademarks of Accenture. MarkLogic A NoSQL Database Presented by:- Chandan Abhishek
Transcript

Copyright © 2013 Accenture All Rights Reserved. Accenture, its logo, and Accenture High Performance Delivered are trademarks of Accenture.

MarkLogic – A NoSQL Database

Presented by:- Chandan Abhishek

Copyright © 2013 Accenture All Rights Reserved.

About me…

Copyright © 2015 Accenture All Rights Reserved. Confidential — For Company Internal Use Only.

Chandan Abhishek

About 6 Years of comprehensive IT Experience (out of some months of

Onsite in UK).

In Accenture for 1.2yrs and part of Digital Data & Analytics Capability.

Currently working for client “Warner bros” which comes under CMT.

Technical Expertise in Big Data, MarkLogic, PL-SQL and .Net framework.

Professional experience with major clients like Springer, Macmillan and

Warner bros.

Recently won CMT Apex Award.

Copyright © 2013 Accenture All Rights Reserved.

MarkLogic Server

Copyright © 2015 Accenture All Rights Reserved. Confidential — For Company Internal Use Only.

• XML Server

• Special-purpose DBMS for XML

– Semi-structured

– Hierarchical

• Designed for 100s of TB of XML

Copyright © 2013 Accenture All Rights Reserved.

How Did We Get Here?

• Founder: Christopher Lindblad

– MIT

– Architect of Ultraseek Server

• Intranet seach engine product

• Met people that wanted to use a search engine like a database

– Rich query language

– Guaranteed correctness

– Transactions

Copyright © 2013 Accenture All Rights Reserved.

Consider an Application

• Documents + metadata

• Documents: rich, variable structure

• Want: complex full-text search

• Want: combined text, metadata, structure-aware search

• Want: granular ad hoc access

• Want: real-time query

• How do you build it?

Copyright © 2013 Accenture All Rights Reserved.

Two-headed Monster

I’m an RDBMS

Answers are right or wrong

I like to combine small pieces

I allow granular access

Linguistic complexity hurts my brain

I guarantee ACID properties

Updates are visible right away

I’m a search engine

Some answers are better than others

Most pieces of information are large

I can give you the whole document

Structure hurts my brain

I’m optimized for sparse data

Updates are visible… oh, whenever

Copyright © 2013 Accenture All Rights Reserved.

A Different Approach

• Soul of Search Engine: Data Model And Queries

• Database: On-disk Organization And Transactions

Copyright © 2013 Accenture All Rights Reserved.

Data Model

•Document

•Title

•Author

•Abstract

Section

Section

•Footer

Section

Section

Section (cont’d)

•Metadata

Copyright © 2013 Accenture All Rights Reserved.

Data Model

• A database for XML . . .

. . . uses the XML Data Model

• XML is a tree

Document

Title Author

Section

Section Section Section Section Section

FirstLast

Metadata

Copyright © 2013 Accenture All Rights Reserved.

Example Document

<article>

<title>MarkLogic Server: The Best Place for XML</title>

<author><first-name>John</first-name><last-name>Kreisa</last-name></author>

<abstract>

Where should one put their XML? <company>Mark Logic</company> has the best

answer to this question: MarkLogic Server. . . .

</abstract>

<body>

<section>

<section> This high performance engine can . . . </section>

</section>

<section> Using an inverted index technique . . . </section>

</body>

<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>

</article>

Copyright © 2013 Accenture All Rights Reserved.

What Queries Is It Good At?

1) Full-Text Search

Find all documents that contain the phrase “high performance”.

2) XML Structure

Find all articles that have an abstract.

3) XML Semantics

Find all documents that mention the company “Mark Logic”.

4) All of the above . . .

Find all articles that contain the phrase “high performance” and

mention the company Mark Logic in the abstract.

at the same time

Copyright © 2013 Accenture All Rights Reserved.

1) Full-text Search

Find all documents that contain the phrase “high performance”

<article>

<title>MarkLogic Server: The Best Place for XML</title>

<author><first-name>John</first-name><last-name>Kreisa</last-name></author>

<abstract>

Where should one put their XML? <company>Mark Logic</company> has the best

answer to this question: MarkLogic Server. . . .

</abstract>

<body>

<section>

<section> This high performance engine can . . . </section>

</section>

<section> Using an inverted index technique . . . </section>

</body>

<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>

</article>

Copyright © 2013 Accenture All Rights Reserved.

1) Full-text Search

very

high

perform

ance

index

122 0 1 0 0

123 1 0 1 1

124 0 0 0 0

125 0 1 0 0

126 0 1 1 0

127 1 0 0 0

129 1 1 0 0

130 0 1 1 1

Find all documents that contain the phrase “high performance”

Copyright © 2013 Accenture All Rights Reserved.

1) Full-text Search

UNIVERSAL INDEX

“very”

“high”

“performance”

“index”

“high performance”

“very high”

“performance index”

123, 127, 129, 152, 344, 791 . . .

122, 125, 126, 129, 130, 167 . . .

123, 126, 130, 142, 143, 167 . . .

123, 130, 131, 135, 162, 177 . . .

126, 130, 167, 212, 219, 377 . . .

. . .

. . .

Document

References

126, 130, 167, 212, 219, 377 . . .

Find all documents that contain the phrase “high performance”

Copyright © 2013 Accenture All Rights Reserved.

2) XML Structure

Find all articles that have an abstract

<article><title>MarkLogic Server: The Best Place for XML</title>

<author><first-name>John</first-name><last-name>Kreisa</last-name></author>

<abstract>Where should one put their XML? <company>Mark Logic</company> has the best

answer to this question: MarkLogic Server. . . .

</abstract>

<body>

<section>

<section> This high performance engine can . . . </section>

</section>

<section> Using an inverted index technique . . . </section>

</body>

<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>

</article>

Copyright © 2013 Accenture All Rights Reserved.

2) XML Structure

UNIVERSAL INDEX

“very”

“high”

“performance”

“index”

“high performance”

<article>

<article>/<abstract>

123, 127, 129, 152, 344, 791 . . .

122, 125, 126, 129, 130, 167 . . .

123, 126, 130, 142, 143, 167 . . .

123, 130, 131, 135, 162, 177 . . .

126, 130, 167, 212, 219, 377 . . .

. . .

. . .

Document

References

126, 130, 167, 212, 219, 377 . . .

Find all articles that have an abstract

Copyright © 2013 Accenture All Rights Reserved.

3) XML Semantics

Find all documents that mention the company “Mark Logic”

<article>

<title>MarkLogic Server: The Best Place for XML</title>

<author><first-name>John</first-name><last-name>Kreisa</last-name></author>

<abstract>

Where should one put their XML? <company>Mark Logic</company>has the best answer to this question: MarkLogic Server. . . .

</abstract>

<body>

<section>

<section> This high performance engine can . . . </section>

</section>

<section> Using an inverted index technique . . . </section>

</body>

<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>

</article>

Copyright © 2013 Accenture All Rights Reserved.

3) XML Semantics

UNIVERSAL INDEX

“very”

“high”

“performance”

“index”

“high performance”

<article>

<article>/<abstract>

<company>Mark Logic</

123, 127, 129, 152, 344, 791 . . .

122, 125, 126, 129, 130, 167 . . .

123, 126, 130, 142, 143, 167 . . .

123, 130, 131, 135, 162, 177 . . .

126, 130, 167, 212, 219, 377 . . .

. . .

. . .

Document

References

126, 130, 167, 212, 219, 377 . . .

Find all documents that mention the company “Mark Logic”

Copyright © 2013 Accenture All Rights Reserved.

4) All Of The Above

Find all articles that contain the phrase “high performance” and

mention the company “Mark Logic” in the abstract

<article>

<title>MarkLogic Server: The Best Place for XML</title>

<author><first-name>John</first-name><last-name>Kreisa</last-name></author>

<abstract>

Where should one put their XML? <company>Mark Logic</company>has the best answer to this question: MarkLogic Server. . . .

</abstract>

<body>

<section>

<section> This high performance engine can . . . </section>

</section>

<section> Using an inverted index technique . . . </section>

</body>

<copyright>Copyright© 2008 Mark Logic Corporation. All rights Reserved.</copyright>

</article>

Copyright © 2013 Accenture All Rights Reserved.

4) All Of The Above

UNIVERSAL INDEX

“very”

“high”

“performance”

“index”

“high performance”

<article>

<article>/<abstract>

<abstract>/<company>

<company>Mark Logic</

123, 127, 129, 152, 344, 791 . . .

122, 125, 126, 129, 130, 167 . . .

123, 126, 130, 142, 143, 167 . . .

123, 130, 131, 135, 162, 177 . . .

126, 130, 167, 212, 219, 377 . . .

. . .

. . .

Document

References

126, 130, 167, 212, 219, 377 . . .

Find all articles that contain the phrase “high performance” and

mention the company “Mark Logic” in the abstract

Copyright © 2013 Accenture All Rights Reserved.

Scalar Indexes

UNIVERSAL INDEX

“very”

“high”

“performance”

“index”

“high performance”

<article>

<article>/<abstract>

<abstract>/<company>

<company>Mark Logic</

123, 127, 129, 152, 344, 791 . . .

122, 125, 126, 129, 130, 167 . . .

123, 126, 130, 142, 143, 167 . . .

123, 130, 131, 135, 162, 177 . . .

126, 130, 167, 212, 219, 377 . . .

. . .

. . .

Document

References

126, 130, 167, …

Identify a set of documents based on criteria and then characterize the

set with scalar indexes (float, dateTime, string etc.)

Copyright © 2013 Accenture All Rights Reserved.

Geospatial, too

UNIVERSAL INDEX

“very”

“high”

“performance”

“index”

“high performance”

<article>

<article>/<abstract>

<abstract>/<company>

<company>Mark Logic</

123, 127, 129, 152, 344, 791 . . .

122, 125, 126, 129, 130, 167 . . .

123, 126, 130, 142, 143, 167 . . .

123, 130, 131, 135, 162, 177 . . .

126, 130, 167, 212, 219, 377 . . .

. . .

. . .

Document

References

126, 130, 167, …

Just a special kind of scalar index, except values are points and scan

operators know about Earth geometry

Copyright © 2013 Accenture All Rights Reserved.

Universal Index Is Our Hammer

We turn queries into nails

Copyright © 2013 Accenture All Rights Reserved.

Examples Of Nails

• Directories

– Exclusive, hierarchical, analogous to file

system, map to URI

• Collections

– Set-based, N:N relationship

• Security

– Invisible to your app

Copyright © 2013 Accenture All Rights Reserved.

Many Shapes And Sizes

News Article Book Research Report

Slide Presentation Product Sheet Operations Manual

Copyright © 2013 Accenture All Rights Reserved.

Load As Is

XML is self-describing

<article>

<title>MarkLogic Server: . . .</title>

<author>

<first-name>John</first-name>

<last-name>Kreisa</last-name>

</author>

<abstract>

. . . . <company>Mark Logic</company>

</abstract>

<body>

<section>

<section> . . .</section>

</section>

<section> . . . index . . . </section>

</body>

<copyright>Copyright© . . . </copyright>

</article>

Copyright © 2013 Accenture All Rights Reserved.

Load As Is

<article>

<title> <abstract><body> <copyright>

<author>

<first-name>

<last-name>

<section> <section>

<section>

<company>

"MarkLogic Server: . . ."

"John"

"Kreisa"

"MarkLogic"

" . . . " " . . . "

" . . . "

“ . . . "" . . . index. . . "

XML is self-describing

Copyright © 2013 Accenture All Rights Reserved.

Load As Is

<article>

<title> <abstract><body> <copyright>

<author>

<first-name>

<last-name>

<section> <section>

<section>

<company>

"MarkLogic Server: . . ."

"John"

"Kreisa"

"MarkLogic"

" . . . " " . . . "

" . . . "

“ . . . "" . . . index. . . "

XML is self-describing No Schema Needed!

Copyright © 2013 Accenture All Rights Reserved.

Degrees Of Flexibility

Str

uc

ture

Ad h

oc

Pre

defined

Queries

Ad hocPredefined

IMS

IDMSRelational

Databases

Search

EnginesMarkLogic

Server

XML

Server

Copyright © 2013 Accenture All Rights Reserved.

The Query Language

XMLUniversal

Index

XQuery

Full-Text Search

XML Structure

XML Semantics

Application Logic

Manipulate XML

Render Results

Load As Is

Copyright © 2013 Accenture All Rights Reserved.

The Programming Language

XMLUniversal

Index

XQuery

Full-Text Search

XML Structure

XML Semantics

Application Logic

Manipulate XML

Render Results

Load As Is

Copyright © 2013 Accenture All Rights Reserved.

A Different Approach

• Sould of a Search Engine: Data Model And Queries

• Database: On-disk Organization And Transactions

Copyright © 2013 Accenture All Rights Reserved.

What’s In A Database?

• No tables

• No rows

• forests . . .

. . . . of trees

•Database

Forest1 Forest2Forest3

Copyright © 2013 Accenture All Rights Reserved.

Host e1

Forest1

Host ek

Host d1 Host d2 Host d3 Host dl

Forest2 Forest3 Forestm

Host e2

Forest4

The Cluster

Copyright © 2013 Accenture All Rights Reserved.

What About Updates?

• Typical XML document:

– 10KB – 1MB

– Referenced by 1,000s to 10,000s of term lists

• Search engines are bad at updates

– Many indexes to update

– Option: Index and Information out of sync

– Option: Slow

• We want

– High throughput

– Transactions (ACID)

• So how do we avoid updates?

Copyright © 2013 Accenture All Rights Reserved.

Solution: Temporal Database

• No update! No delete!

• Only insert and read-at-a-time

• Every document has two timestamps

– “created”, “expired”

Copyright © 2013 Accenture All Rights Reserved.

Temporal Database

520 528

Createa.xml

Createb.xml

Updatea.xml Updatea.xml

Deleteb.xml...

QueryQuery

Copyright © 2013 Accenture All Rights Reserved.

The Cluster

Host e1

Forest1

Host ek

Host d1 Host d2 Host d3 Host dl

Forest2 Forest3 Forestm

Host e2

Forest4

Copyright © 2013 Accenture All Rights Reserved.

•Host

A Single Forest

Stand1 Stand2 Standn

BufferForestk

Buffer

Copyright © 2013 Accenture All Rights Reserved.

•Host

1. Create A New Tree

Stand1 Stand2 Standn

BufferForestk

Buffer

Copyright © 2013 Accenture All Rights Reserved.

•Host

2. Expire Trees

Stand1 Stand2 Standn

BufferForestk

Buffer

Copyright © 2013 Accenture All Rights Reserved.

•Host

3. Save A Buffer To Disk

Stand1 Stand2 Standn

BufferForestk

Buffer

Copyright © 2013 Accenture All Rights Reserved.

•Host

4. Optimization: Merge Stands

Buffer

Forestk

Copyright © 2013 Accenture All Rights Reserved.

The Four Forest Operations

1. Create a new document

• Into a buffer

2. Mark a document as expired

• Memory-mapped document timestamps per stand

3. Write buffer out to disk

• Our buffers are 100s of megabytes

• For performance, double buffer

4. Merge

• Background process

• Optimization: reduces number of stands in forest

Copyright © 2013 Accenture All Rights Reserved.

Consistency And Throughput

• 2-phase commit

– Transactions span forests

• Recovery

– Forest Journals

• Lock-free queries

– Use the search engine at a point-in-time

– Increased throughput

– Time travel?

Copyright © 2013 Accenture All Rights Reserved.

A Different Approach

• Sould of a Search Engine: Data Model And Queries

• Database: On-disk Organization And Transactions

Copyright © 2013 Accenture All Rights Reserved.

Native XQuery Support

• MarkLogic supports XQuery as its native interface

– Query language designed for querying XML data and content

– An open, W3C standard

Example content query: quality assurance

for $proc in /book/section[title = "Procedure"]

where not (some $a in $proc//anesthesia

satisfies $a << ($proc//incision)[1])

return $proc

Copyright © 2013 Accenture All Rights Reserved.

Native XQuery Support

• MarkLogic supports XQuery as its native interface

– Query language designed for querying XML data and content

– An open, W3C standard

•Example content query: quality assurance

•Find all medical procedures that have incision before anesthesia

for $proc in /book/section[title = "Procedure"]

where not (some $a in $proc//anesthesia

satisfies $a << ($proc//incision)[1])

return $proc

Copyright © 2013 Accenture All Rights Reserved.

Manipulate Content

• Navigate within content

– Walk through the tree structure of the document – e.g.,

– Create breadcrumb trail to top of document

– Move to adjacent paragraphs, illustrations, tables, or captions

• Modify content programmatically

– Translate content to different languages

– Alphabetize index terms and produce new index sheet

– Summarize by returning lead paragraphs or topic sentences

• Combine content from multiple sources

– Nested queries across content sources

– Create common index across content from multiple sources

Copyright © 2013 Accenture All Rights Reserved.

Render Content

• Flexibly output content for multi-channel delivery

– XHTML for web browsers

– XSL-FO for PDF generation, custom publishing

– WML for mobile devices

– Office XML for Microsoft Office documents

• High-performance, server-based transformations

– Performed close to the content

– Faster than XSLT

Copyright © 2013 Accenture All Rights Reserved.

(1) Specify a search using composable constructors

cts:and-query(("wrist", "injury"))

cts:or-query((cts:and-query(("cat", "scratch"))

cts:and-query(("dog", "bite")) ))

cts:and-not-query(“United States”, "Texas")

cts:element-query(xs:QName("Year"),

cts:or-query(("1980", "1981")))

(2) Define a searchable set of nodes

//MedlineCitation[

Journal/JournalIssue/PubDate/Year = "1980"]

(3) Apply the search query to the nodes

cts:search(//MedlineCitation,

cts:and-query(("wrist", "injury")))

(4) Return the results in relevance order

Search Processing Model

Copyright © 2013 Accenture All Rights Reserved.

Free Text Search

cts:word-query( $text as xs:string,

[$options as xs:string*],

[$weight as xs:double] )

as cts:word-query

• Options include:"case-sensitive“ Specifies a case-sensitive query

"case-insensitive“ Specifies a case-insensitive query

"punctuation-sensitive" Specifies a punctuation-sensitive query

"punctuation-insensitive" Specifies a punctuation-insensitive query

"stemmed" Specifies a stemmed query

"unstemmed" Specifies an unstemmed query

"wildcarded" Specifies a wildcarded query

"unwildcarded“ Specifies an unwildcarded query

"lang=en“ Specifies, (e.g.) that the query is in English

Copyright © 2013 Accenture All Rights Reserved.

Boolean Queries

• cts:and-query()

conjunction of an arbitrary lists of sub-queries

• cts:or-query()

disjunction of an arbitrary lists of sub-queries

• cts:and-not-query()

relative complement of two sub-queries

• cts:not-query()

complement of a single sub-query

Copyright © 2013 Accenture All Rights Reserved.

Linguistic Controls

• “case-sensitive”, “case-insensitive” options

– Configuration option to add case-sensitive index termscts:word-query(“Genetic Engineering”,”case-insensitive”)

• “punctuation-sensitive”, “punctuation-insensitive”

– Configuration option to add punctuation-sensitive index termscts:word-query(“Genetic-Engineering”,”punctuation-

insensitive”)

• Stemming - “stemmed”, “unstemmed” query options

– Stemming does not cross different parts of speech

• Thesaurus – XML Schema, query expansion

Copyright © 2013 Accenture All Rights Reserved.

Spelling

• Double-metaphone

• Spelling suggestionsspell:suggest("/mySpell/spell.xml","alfabet")

• Spell checking

spell:is-correct("/mySpell/spell.xml","alfabet")

• Dictionary load and managementspell:load("c:\dictionaries\spell.xml",

"/mySpell/spell.xml")

spell:add-word("/mySpell/spell.xml",”uxorious”)

spell:remove-word("/mySpell/spell.xml","atomise")

Copyright © 2013 Accenture All Rights Reserved.

Wildcards

• A*, *B, A*B, A?, ?B, A?B, A*B*C, A*B?C*, etc.

• Regular expression optimization

• For example:

cts:search(input(), cts:word-query("he*"))

will result in a wildcard search

• Character indexing provides optimization forfn:contains(), fn:matches(),

fn:starts-with(), fn:ends-with()

Copyright © 2013 Accenture All Rights Reserved.

Proximity Queries

cts:near-query($queries, $distance, $ordered, $weight)

• The results match if two queries match and the distance between the two matches is equal to or less than the specified distance. A distance of 0 matches only when there is overlapping text. The default value is 100.

• For example,

cts:search(//p,

cts:near-query(

(cts:word-query("James"),

cts:word-query("Maxwell")), 2))

Copyright © 2013 Accenture All Rights Reserved.

Proximity Queries

For example,

cts:search(//p,

cts:near-query((

cts:near-query(("James","Maxwell"), 2),

cts:near-query(("Albert", "Einstein"), 2),

cts:near-query(("Lorentz", "Contraction"), 2)

), 50, "unordered"))

Copyright © 2013 Accenture All Rights Reserved.

Beyond Free Text Search

• XML Query / search integration

– XML granular search

– XPath constraints

– Rich interaction between text and structural constraints

– Free access to all fields and combinations of constraints

• XML searchable database

– Integrate data, metadata, search and update

Copyright © 2013 Accenture All Rights Reserved.

Range Queries

• Numeric range queries are optimized with range indexes

//article[date <= xs:date("2002-10-10T17:00:00Z")]

• Lexicographic range queries, likewise

//article[("A" <= name) and (name < "B")]

• Sort optimization uses range indexes to eliminate post-sort

for $x in //article

order by $x/last/name, $x/first/name

return <li>{ $x/date }</li>

Copyright © 2013 Accenture All Rights Reserved.

Structured Highlighting

• Embed hyperlink to commercial drug equivalent for each instance of a generic drug:

define function

lookup-drug-name($name as xs:string)

{

<xhtml:selection>

{

doc(“drug-list.xml”)/name[.=$name]/variants

}

</xhtml:selection>

}

for $a in cts:search(//articles, "ibuprofen“ )

return

cts:highlight($a, cts:word-query( "ibuprofen“ ),

lookup-drug-name( "ibuprofen” ))

Copyright © 2013 Accenture All Rights Reserved.

Summary

• XML as data model

– Ad hoc schema

• A search engine core

– Universal Index

• Temporal transaction model

– High throughput while keeping . . .

• Performance and scalability of a search engine

Copyright © 2013 Accenture All Rights Reserved. Copyright © 2015 Accenture All Rights Reserved. Confidential — For Company Internal Use Only.

Questions ??????

Copyright © 2013 Accenture All Rights Reserved. Copyright © 2015 Accenture All Rights Reserved. Confidential — For Company Internal Use Only.

Thank You.

Please contact for further info [email protected]