+ All Categories
Home > Documents > NoSQL for Next Gen CMS - narensuri.files.wordpress.com · [In short: Not Only SQL] ... Cassandra...

NoSQL for Next Gen CMS - narensuri.files.wordpress.com · [In short: Not Only SQL] ... Cassandra...

Date post: 15-Oct-2018
Category:
Upload: lenhu
View: 220 times
Download: 0 times
Share this document with a friend
12
LEADING EDGE FORUM CSC PAPERS Copyright © 2013 Computer Sciences Corporation. All rights reserved. NoSQL for Next Gen CMS NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM K V S Ranga Prasad [ [email protected] ] J Deepika [ [email protected] ] M Ravi Chandrasekhar [ [email protected] ] R Ravi Sudharson [ [email protected] ] Naren Raghavendra Suri [ [email protected] ] Title CSC Authors CSC Papers 2013
Transcript
Page 1: NoSQL for Next Gen CMS - narensuri.files.wordpress.com · [In short: Not Only SQL] ... Cassandra Graph – Neo4j ... Business Benefits while Y-axis defines the Performance at different

LEADING EDGE FORUM CSC PAPERS Copyright © 2013 Computer Sciences Corporation. All rights reserved.

NoSQL for Next Gen CMS

NOSQL FOR NEXT GEN CONTENT

MANAGEMENT SYSTEM

K V S Ranga Prasad [ [email protected] ]

J Deepika [ [email protected] ]

M Ravi Chandrasekhar [ [email protected] ]

R Ravi Sudharson [ [email protected] ]

Naren Raghavendra Suri [ [email protected] ]

Title

CSC Authors

CSC Papers

2013

Page 2: NoSQL for Next Gen CMS - narensuri.files.wordpress.com · [In short: Not Only SQL] ... Cassandra Graph – Neo4j ... Business Benefits while Y-axis defines the Performance at different

2

NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM

1. INTRODUCTION

For users, a Content Management System comprises of processes:

Collecting relevant content and storing it to database

Content management and processing

Publishing the content in different forms as per end user needs

In the real-world, most of the domains / systems/ applications have CMS systems in-place, to name

a few: Electronic Media, Health Care [like: code books], Complex dictionaries, Online-Reservation

systems, Logistics etc are big business in the market.

Big Data is a buzz word today. NoSQL databases are already captivating the market. To stay

ahead of the race, seize the business opportunity before clients identify the need for change.

Embracing the High performance open source technologies like Big Data and NoSQL is definitely a

beneficial thing to give attention.

To assist clients / organizations who use Legacy Content Management System, we did a Case

Study / POC on a Legacy Content Management System. As a part of this, we did an end-to-end

study of the current system and identified the areas that need to be addressed in order to transform

current marginal efficient system to Real -Time Content Management System.

In order to address this, we started looking at a new model which can enhance the current system

without impacting the structure/functionality of how a Content Management System should behave

i.e. Content Generation, Work-Flow, Authorization, Validation, Publishing etc;. Thought process for

creating a new model; paved path for evaluating the NoSQL capabilities in place of our current

RDBMS [Oracle].

The proposed new model / solution is expcted to enhance and bring in the benefits of using NoSQL

inplace of traditional RDBMS. Benefits include:

Real-time Content Management System

Operational Efficiency

High Performance

Cost Benefits [open-source]

The goal of this paper is to describe the soultion / new model and showcase the results.

1.1 BRIEF NOTE ON NOSQL DATABASES

[In short: Not Only SQL] A NoSQL database provides a mechanism for storage and retrieval

of data that use looser consistency models than traditional relational databases in order to achieve horizontal scalability with ease and provide higher availability. In general NoSQL databases are classified as:

Key / Value – Voldemort, Simple DB, Memcache, Amazon’s Dynamo

Document – MongoDB, CouchDB

Column – HBase, Cassandra

Graph – Neo4j, Infinite Graph

Others – Geospatial, File System, Object

Page 3: NoSQL for Next Gen CMS - narensuri.files.wordpress.com · [In short: Not Only SQL] ... Cassandra Graph – Neo4j ... Business Benefits while Y-axis defines the Performance at different

3

NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM

1.2 NOSQL CAPABILITIES

During this thought process of creating the new model for Legacy CMS system, we had brain-storming sessions to see which technologies can be best suited for us to promote a new model which can bring in more benefits to the customer. Outcome of our brain-storming sessions was to switch the CMS data storage from RDBMS [Oracle] to NoSQL database because of NoSQL capabilities which includes:

High Performance

High Availability [In-built Caching Mechanism]

Handle Huge Volume of Data

DB Scaling-Out [adding nodes to existing set-up] is Elastic in nature

2. EVALUATION OF CURRENT SYSTEM

As described in previous section, we have performed an end-to-end study of the current CMS - a Legacy System. During this process we evaluated the system in a phase-by-phase manner to identify the areas, where the current process is consuming more time or causing delay in getting content published to end clients / down-stream applications.

2.1 CURRENT SCENARIO

Our client’s Legacy Content Management System [CMS] is designed to publish the content to downstream applications in desired formats [which includes .txt, .dat, .xml. .html, .sql etc;]. During the process of content publishing, CMS system fetches the content from Oracle [which is in XML format and stored as CLOB] and processes the content in stages. The design of current system is not feasible for real-time content publishing [which was expected by the system] as content generation process itself is taking huge time. As an impact of this, content is being published at intervals – daily, weekly, monthly, quarterly. Here is our system wired diagram:

Data Layer RDBMS - Oracle

ValidationLayer

Content Generation

Content Processing

Content Publishing

File Server

Applications

Page 4: NoSQL for Next Gen CMS - narensuri.files.wordpress.com · [In short: Not Only SQL] ... Cassandra Graph – Neo4j ... Business Benefits while Y-axis defines the Performance at different

4

NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM

2.2 CURRENT SYSTEM BASE-LINE MERTICS

We have collected the current system statistics to have a basis for comparison. This stats can help us in understanding how our proposed solution/new model can enhance the system better and achieve our goal of Real-Time Content Management System.

2.2.1 TIME CONSUMED BY THE SYSTEM TO PROCESS THE CONTENT

This metrics depicts about the time consumed by CMS in processing the content and make it ready for publishing and finally deliver to customer. As per our study we have segregated the content processed based on the complexity and volume of content. From the below figures [marked in Green] it is evident that most of the content requires a huge amount of processing time.

2.2.2 TIME CONSUMED FOR CMS PROCESS IN A QUARTER

This metrics depicts the effort in hours consumed by different phases of current CMS system in Generating, Processing and Publishing the content.

10%

23%

67%

< 1 hr

1 hr to 3 hrs

> 3 hrs

0

200

400

600

800

1000

1200

441

1012

552

Effort In Hours

Page 5: NoSQL for Next Gen CMS - narensuri.files.wordpress.com · [In short: Not Only SQL] ... Cassandra Graph – Neo4j ... Business Benefits while Y-axis defines the Performance at different

5

NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM

2.2.3 FREQUENCY OF CONTENT DELIVERED TO ONLINE CUSTOMERS

This metric depicts the frequency of content being published to end customer. Based on our study on the current legacy CMS system, we are pretty clear that 74% of the content is not published and delivered to the end customers on a daily basis.

2.3 CHALLENGES IN CURRENT SYSTEM

Current system is experiencing the below list of challenges:

System stores its content in RDBMS [Oracle]. For processing the content, the system

fetches content from Oracle. RDBMS response time for content query [having records

counts more than100K +] is adding considerable delay.

System model/architecture/process is adding huge delay due to too many layers between

content generation process to publish and deliver to end customer.

System is not elastic in nature because: system delivers the content in a periodic way.

System uses XML medium for content generation and processing. Processing huge XML

files using XSLT requires additional processing time.

During content generation process, searching the appropriate Lexicon content [concept /

terms] in Healthcare dictionary is a tedious activity.

1%

36%

23%

1%

12%

26% Annualy

Quarterly

Monthly

Bi-Weekly

Weekly

Daily

Page 6: NoSQL for Next Gen CMS - narensuri.files.wordpress.com · [In short: Not Only SQL] ... Cassandra Graph – Neo4j ... Business Benefits while Y-axis defines the Performance at different

6

NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM

3. OUR APPROACH

We looked forward for a solution which can pave the path to reach our goal: Real-Time Content System Management. Outcome of our thought process is to suggest a new model which can enhance the current system.

We recommend a strongly favored in-house solution which can be done in phase-by-phase manner instead of Big-Bang way.

We have plotted different paths [based on our analysis, research, estimation techniques] to reach our final goal: Real-Time Content System Management using NoSQL DB.

This path-plotted graph along with the Break-Even Analysis is the heart and soul of our new model / solution to transform the existing system.

Let us take a look at the below graph closely. We have X-axis defining the Operational Cost + Business Benefits while Y-axis defines the Performance at different stages. These paths are plotted considering medium in which data is going to be stored and processed using NoSQL DB and published as per end user requirements i.e. in form of html, xml, text, pdf, word etc;.

PATH –PLOTTED GRAPH TO CHOOSE THE BEST PROJECT EXECUTION

Page 7: NoSQL for Next Gen CMS - narensuri.files.wordpress.com · [In short: Not Only SQL] ... Cassandra Graph – Neo4j ... Business Benefits while Y-axis defines the Performance at different

7

NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM

BREAK –EVEN ANALYSIS FOR THE SUGGESTED MODEL

Above path-plotted graph can depict the picture, how a project can be handled based on the path chosen by the client. Each path is estimated in person years.

Of all the three paths, whatever path the client may choose, he can reap his benefits in ≈ 3yrs

once after the project starts.

$839,560

$1,679,120

$2,518,680

$3,358,240

$4,197,800

$1,372,800

$2,082,300

$2,494,800

$2,907,300

$3,319,800

$0

$500,000

$1,000,000

$1,500,000

$2,000,000

$2,500,000

$3,000,000

$3,500,000

$4,000,000

$4,500,000

1st Year 2nd Year 3rd Year 4th Year 5th Year

Cumulative for existing System

Cumulative for new system

Page 8: NoSQL for Next Gen CMS - narensuri.files.wordpress.com · [In short: Not Only SQL] ... Cassandra Graph – Neo4j ... Business Benefits while Y-axis defines the Performance at different

8

NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM

3.1 ENHANCE THE DATA LAYER / FILE SERVER LAYER WITH NOSQL DB

We took our initial steps towards enhancing the Data Layer of our system. As discussed in section: Introduction, we are more lured and drifted towards the NoSQL capabilities and business benefits.

Of all the NoSQL databases, we have opted for a Document-Oriented database which is MongoDB [- named from "huMONGOus," meaning "extremely large"] because – usability &

installation, dynamic schema, open-source which is a cost effective, Replica-Sets for Master / Slave and automatic fail-over, Sharding for elastic DB scale-out.

Our current system, Content Generation, Content Processing and Content Publishing processes are not effectively coupled. We say this because, Content Generation might have added, updated or deleted a particular content and the RDBMS gets updated with content change. Both Content Processing and Content Publishing process are periodic and coupled.

As Content Processing is a periodic process which means few contents run on Daily, Weekly, Monthly and Quarterly basis. Because of these latest content changes are not processed and not available for the Content Publishing process to publish the content. So, in order to achieve a Real-Time Content Management System, we need to couple-up all the required process so that whenever there are changes to CMS database, it should be able to trigger up the downstream process to publish the content to end-user applications.

MongoDB capabilities can definitely help us in enhancing our system by improving the Content Processing and Content Publishing layers thereby reducing the overall time consumed. Model look like:

Page 9: NoSQL for Next Gen CMS - narensuri.files.wordpress.com · [In short: Not Only SQL] ... Cassandra Graph – Neo4j ... Business Benefits while Y-axis defines the Performance at different

9

NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM

On implementing this model /solution, we will eliminate the need for RDBMS [Oracle], Fileservers, some background copy jobs, etc. MongoDB acts as Central Repository for Content Processing and Content Publishing layers. Once after content is processed it will be stored onto Central Repository and publishing layer can take up the approved content and start publishing which can save some considerable amount of time.

New model allows the system to publish the latest content changes to end customers instead of periodic publishing.

As a part of this phase, we migrated the existing content [partially which includes 270,000 xml files and 1,500,000 lexicon terms]. Once after migration we took a process that runs for < 1 hour in our existing system. We first did a base-line using current system to measure the content generation processing time and then with the new system in-place.

HUGE CUT-DOWN IN CONTENT PROCESSING TIME

Below metrics depicts the time consumed at each stage of Content Processing before and after replacing RDBMS with NoSQL – MongoDB.

It is very clear that MongoDB has improved system performance and cut-down the processing time by ≈5 times.

Based on the results obtained, the overall system performance is improved by 60%

0

5

10

15

20

25

30

Extraction Time

Mastering Time

Total Process Time

21

5

28

1.22 1.35

5

Before [in mins]

After [in mins]

0

200

400

600

800

1000

1200

441

1012

552

300 409

300 Before [Effort In Hours]

After [Effort In Hours]

Page 10: NoSQL for Next Gen CMS - narensuri.files.wordpress.com · [In short: Not Only SQL] ... Cassandra Graph – Neo4j ... Business Benefits while Y-axis defines the Performance at different

10

NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM

VERY QUICK IN FETCHING CONTENT

We did a study on this data fetch process. We have created a query in RDBMS [Oracle] and MongoDB just to fetch the content from the DB and count the number of records. Here is the result:

Oracle Query:

SELECT B.XML_DATA FROM TABLE B WHERE B.NODE_ID=? AND B.CURRENTPUB_FLAG=? AND B.ACTIVITY_STATUS=?

MongoDB Query:

queryObject.put(AU_TYPE, ”xxxxx”);

queryObject.put(ACTIVITY_STATUS, "ACTIVE");

queryObject.put(CURRENTPUB_FLAG, "T");

collection.find(queryObject).addOption(Bytes.QUERYOPTION_NOTIMEOUT).addOption(Bytes.QUERYOPTION_AWAITDATA);

This query [in both SQL & NoSQL] is expected to fetch ≈ 110k records from the database.

Our study results proved that MongoDB is ≈ 16 times faster than Oracle in fetching the

content from database.

0

5

10

15

20

Query Response Time

19

1.2

Before [in Minutes]

After [in Minutes]

Page 11: NoSQL for Next Gen CMS - narensuri.files.wordpress.com · [In short: Not Only SQL] ... Cassandra Graph – Neo4j ... Business Benefits while Y-axis defines the Performance at different

11

NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM

LOW OPERATIONAL-COSTS

For comparing the operational costs between MongoDB and RDBMS [Oracle], we gathered metrics from 10gen [the MongoDB Company]. From the below graph it is clear that at we can save ≈ 70% of up-front cost and ≈ 64% of ongoing cost in comparison with

the RDBMS – Oracle Source - http://info.10gen.com/rs/10gen/images/10gen.TCO%20-%20MongoDB%20vs.%20Oracle.pdf

3.2 COMPARISION BETWEEN CURRENT CMS AND NOSQL IN CMS

As a part of this case study / POC, based on the research, results obtained we prepared a chart to perform a factor-by-factor comparison between a Legacy CMS vs. NoSQL in CMS. NoSQL in CMS was able to beat-out the existing CMS system and will be able to achieve our goal: Real-Time Content Management System

0

100

200

300

400

500

600

700

800

900

Up Front Cost

Ongoing Cost 1st Year

Ongoing Cost 3rd Year

820

287

860

166 106

317 RDBMS [k in $]

MongoDB [k in $]

Page 12: NoSQL for Next Gen CMS - narensuri.files.wordpress.com · [In short: Not Only SQL] ... Cassandra Graph – Neo4j ... Business Benefits while Y-axis defines the Performance at different

12

NOSQL FOR NEXT GEN CONTENT MANAGEMENT SYSTEM

4. BENEFITS

Benefits can be reaped by implementing the solution / new model are:

Better Performance and No additional Caching mechanism is required

Cut-Down the operational / maintenance costs as NoSQL DB is open-

source

Helps CMS system to provide Real-Time Content rather than periodical

Elastic DB Scaling-Out

Helps in smooth transition of bringing MongoDB in shoes of RDBMS

Suits for Agile development process

5. CONCLUSION

While the focus of the IT World is on Big Data, Analytics and intelligent system, the Big Data has remarkable breakthrough in Real-time data processing, we must evolve to see the opportunities within applications of existing customers and provide consultancy to customers. It is crucial to understand the customer business and the impact of Big Data in their business. Since, Big Data technologies are soon going to sweep the market that deals with bulk content.

We should note that an able NoSQL database is required to scale with the present market requirements. Hence the integration of MongoDB and Analytic tools would even increases the value. So expanding this analysis into other domains by having Big Data over the cap would be really fetching to the business benefits.

We did an extensive study on a real time product for migrating a Legacy Content Management System's content which is already voluminous and obviously tend to increase in accelerated mode. MongoDB, an open source NoSQL database, maintained by 10gen, is racing in Big Data Technologies and already had its major foot print. After probing through each of Hadoop, Cassandra, Couch DB, VoltDB, Hive, MongoDB came out as best fit for our customer business requirement. Results are very impressive. Performance improvement is stunning and projections show that investments will be realized in not more than 2 year. Of course long term cost benefits are enormous.

Aligned with CSC Business strategy to market Big Data consultancy and expertise, we explored a branch of it which is NoSQL database, MongoDB. Reduced cost of Infrastructure, increased profitability, improved performance, high scalability are direct benefits addressing the challenges posed by accelerated increase in data. Massively parallel processing capabilities of the NoSQL enables information on finger tips to its customers. Need of the hour is to give the customers what they want rather than asking them to find something in what they can sell. Big Data is there to make this happen.

The risks that lie in the way of the customer to "change technology" are reliability, scalability and investment in upfront cost and time to setup the environment. These risks can be translated to opportunities for any service based companies. To change customer perception on reliability and scalability, numerous cases studies and POCs are now readily available, which are tested on real-time data. That will make customers experience Big Data results. Coming to investment concerns, MongoDB has almost zero upfront costs. However developing a reusable or customizable framework that will readily setup NoSQL Environments for customers brings more business and hence revenue. It's another great opportunity as developing framework is one time investment, but it becomes a product to sell and also gain further business opportunity.

Big Data is a buzz word today. NoSQL databases are already captivating the market. To stay ahead of the race, we should seize the business opportunity before clients identify the need for change. Embracing the High performance open source technologies like Big Data and NoSQL is definitely a beneficial thing to give attention.


Recommended