8KMiles Software Services, Inc8kmprod.s3.amazonaws.com/wp-content/uploads/2015/07/...Amazon...

8KMiles Software Services, Inc

Amazon CloudSearch Comparison Report

TABLE OF CONTENTS Smackdown ............................................................................................................................................. 3

Introduction ............................................................................................................................................ 5

Search features 1 - 1 comparison ........................................................................................................... 6

Feature 1: Getting Started ...................................................................................................................... 7

Feature 2: Operations and Management ............................................................................................... 9

2.1 Backup ........................................................................................................................................... 9

2.2 System upgrades and patch management ................................................................................. 10

2.3 Re-indexing ................................................................................................................................. 11

Feature 3: Monitoring ........................................................................................................................... 13

Feature 4: Schema, Data types, Dynamic Fields and Data Import/Export ........................................... 14

4.1 Schema management ................................................................................................................. 14

4.2 Dynamic fields ............................................................................................................................. 14

4.3 Data types ................................................................................................................................... 15

4.4 Data import & export .................................................................................................................. 16

Feature 5: Search and Indexing features .............................................................................................. 18

5.1 Analyzers, Tokenizers and Token filters ...................................................................................... 18

5.2 Faceting ....................................................................................................................................... 20

5.3 Auto Suggestion .......................................................................................................................... 21

5.4 Highlighting ................................................................................................................................. 23

Feature 6: Multilingual support ............................................................................................................ 24

Feature 7: Protocol & API Support ........................................................................................................ 26

7.1 Request and Response formats .................................................................................................. 26

7.2 External Integrations ................................................................................................................... 26

7.3 Protocols Support ....................................................................................................................... 26

Feature 8: High Availability ................................................................................................................... 27

8.1 Replication .................................................................................................................................. 27

8.2 Failover ........................................................................................................................................ 29

Feature 9: Scaling .................................................................................................................................. 32

Feature 10: Customization .................................................................................................................... 34

Feature 11: More .................................................................................................................................. 35

11.1 Client libraries ........................................................................................................................... 35

Feature 12: Cost .................................................................................................................................... 36

Conclusion ............................................................................................................................................. 37


PAGE 3 of 39

8K Miles 2007-2015 1-855-8KMILES (855-856-4537 [email protected]

Smackdown Feature Apache Solr Elasticsearch Amazon

CloudSearch Admin Operations

Backup Replication/Custom handler/Custom scripts

Snapshot API/Custom scripts

Fully-managed

Patch Management Manual/Automated via custom scripts

Manual/Automated via custom scripts

Fully-managed

Re-indexing Manual

Manual

Fully-managed

Manual option available from management console

Monitoring

If hosted on EC2, Amazon CloudWatch

SaaS Monitoring tools like NewRelic, Stackdriver, Datadog

If hosted on EC2, Amazon CloudWatch

SaaS Monitoring tools like NewRelic, Stackdriver, Datadog

CloudSearch default metrics

Maintenance External managed service

External managed service

Fully-managed

API

Client Library

Java, PHP, Ruby, Rails, AJAX, Perl, Scala, Python, .NET, JavaScript

Java, Groovy, JavaScript, .NET, PHP, Perl, Python, and Ruby

Amazon SDK

HTTP RESTful API YES YES YES Request Format XML, JSON, CSV XML, JSON XML, JSON Response Format XML, JSON, CSV XML, JSON XML, JSON

Third party Integrations

Available for Commercial and Open source

Available for Commercial and Open source

Amazon Web Services Integrations available

Search Functions

Schema Schema and Schema-less

Schema and Schema-less

Schema

Dynamic fields support

Yes Yes Yes

Synonyms Yes Yes Yes Multiple indexes Yes Yes No Faceting Yes Yes Yes Rich documents support

Yes Yes No

Auto Suggest Yes Yes Yes Highlighting Yes Yes Yes

Query parser Standard, DisMax, Extended DisMax, Other parsers

Standard, query_string, DisMax, match, multi_match

Simple, structured, Lucene, or DisMax

Geosearch Yes Yes Yes Analyzers, Default/Custom Default/Custom Default

mailto:[email protected]


PAGE 4 of 39


Tokenizers and Token filters Fuzzy Logic Yes Yes Yes Did you mean Default/Custom Default/Custom No Stopwords Yes Yes Yes Customization Yes Yes No

Advanced

Cluster management ZooKeeper in-built Fully-managed

Scaling Vertical scaling/ Horizontal scaling

Vertical scaling/ Horizontal scaling

Fully-managed horizontal scaling

Replication Yes Yes Yes Sharding Yes Yes Yes

Failover Yes, if set up in Cluster Replica mode

Yes, if set up in Cluster Replica mode

Fully-managed

Fault tolerant Yes, if set up in Cluster mode

Yes, if set up in Cluster mode

Fully-managed

Import and Export

Data import Default import handlers, custom import handlers

Rivers modules, Logstash input plugins, custom programs

Batch upload

Data export Default export handlers, custom export handlers

Snapshot API Custom program

Others

Web Interface Solr Admin Sense AWS Management Console



PAGE 5 of 39


Introduction In today's world of vast and available information, a good search experience is central to a good user

experience. Hence, delivering effective search tools has become the key goal of all software

products, market places, e-commerce websites, and content management systems. Developers

looking to deliver a premium search experience to their users should be aware of some broad

trends:

1) Open source and platform-based search engines are replacing proprietary search engines because

of better licensing models and community support.

2) The cloud delivery model is succeeding over the on-premise delivery model because of scalability,

high availability and operating expense.

In light of the above trends, the choice of leading candidates for search technology boils down to

three: Apache Solr, Elasticsearch, and Amazon CloudSearch. At 8KMiles, our clients often ask us how

these three choices compare relative to each other. This report aims to make it easy for developers

to pick the right technology for their application by presenting a comprehensive framework for

evaluation of the three options. We have also applied our framework to top feature sets that are

critical to any search workload. We then broke them down further into granular features and

compared each of the three search engines.

In this report, we summarize our conclusions and present them in a smack down style summary

card. We encourage our readers to run a more in-depth evaluation for their specific use cases.



PAGE 6 of 39


Search features 1 - 1 comparison This section discusses about the search features in detail and how they are present in Apache Solr,

Elasticsearch, and Amazon CloudSearch.

The table below illustrates the most influential features involved in the assessment of our search

engines. These features are identified and grouped based on the various operations of a search

application.

•Server setup, search engine installation and configuration Getting started

•Backup, patches, re-indexing, monitoring Operations

•Schema management, field types, dynamic fields, data import/export, analyzers, did you mean, facets, auto complete, spatial

Indexing , Search and Query

•Replica, Failover, Self-healing clusters High Availability

•Read scaling, write scaling, partitioning options, replication options

Scaling

•Request and response formats and support, protocols supported, external integrations

Protocol & API Support

•Data field types, functions Customization

•Supported programming languages, administration interface

Others

•Infrastructure cost, support cost, on-going management cost, licensing cost, talent cost

Cost



PAGE 7 of 39


Feature 1: Getting Started ‘Getting Started’ is the first step by an engineer to understand the basics of the major features of a

product. In this section, we will see how the search engines discussed above facilitate ‘Getting

Started’.

Apache Solr and Elasticsearch Apache Solr and Elasticsearch require the end users to spend quality time in understanding and setting up the respective search engines. The “Getting Started” manuals of Apache Solr and Elasticsearch assume the end user to have minimal knowledge of search engines, their related functions and architecture.

The installation processes for Apache Solr and Elasticsearch include tasks such as:

• Server setup • Search engine download • Dependent software installations • Setup of environmental requirements • Understanding of basic server commands • Administrative access

Apache Solr and Elasticsearch are shipped with test examples which allow users to do “warm up” search and indexing operations. While the default test schema in Apache Solr is sufficient for the user to get started, the Elasticsearch’s schema-less design allows the user to send document request without any schema.

Amazon CloudSearch

If you already have an Amazon Web Services (AWS) account set up, you can create a CloudSearch domain in a few clicks using the AWS Management console. The AWS CloudSearch Management console guides administrators through a step-by-step process, requesting the user input:

• Instance type • High availability options • Replication options • Schema definitions • Access policies

Among these options, it is important to note that Amazon CloudSearch does not mandatorily prompt for all information. The CloudSearch domain name and engine type are adequate to create a CloudSearch instance.

The other configurations such as schema, instance type, access policies, and high availability options can be modified at a later time based on the application requirements.

The users are abstracted from hardware provisioning, software installation, configuration, cluster setup and other administration activities.

Users receive two administrative regional endpoints: a search endpoint and document endpoint. Both endpoints can be accessed using RESTful API or AWS Software Development kit (SDK) with Identity and Access Management (IAM) credentials.



PAGE 8 of 39


Another important note is that CloudSearch default access policies for the document service and search service endpoints are configured to block all IP addresses. Developers should configure the authorized IP addresses to access the CloudSearch’s endpoints.

CloudSearch also provides a sample dataset, the IMDB movies, which can be used to test drive the CloudSearch service. The CloudSearch developer documentation walks through the steps to launch a test domain using the sample IMDB dataset.

Conclusion

Apache Solr and Elasticsearch expect users to have basic practical knowledge of the search engine and also complete a few significant tasks to accomplish the first step ‘Getting Started’.

In Amazon CloudSearch, the ‘Getting Started’ activities are easier and end users can have the CloudSearch instance up and running with few clicks in a few minutes.



PAGE 9 of 39


Feature 2: Operations and Management In this section, we’ll discuss some important administrative operations such as

• Index backup

• Patch management

• Re-indexing and recovery

2.1 Backup

Data backup is a routine operation, carried out within a defined period of time. Data backup is an

essential task for recovering data responsively from failures such as hardware crash, data corruption

or related events.

Apache Solr Apache Solr provides a feature called ‘ReplicationHandler’. The main objective of ReplicationHandler is to replicate index data to slave servers, but it can also be used as a backup copy server. A replication slave node can be configured with Solr Master, which can be solely identified as a backup server, with no other operations taking place on the slave node.

Solr‘s implicit support for replication allows ReplicationHandler to be used as an API. The API has optional parameters like location, name of snapshot, and number of backups. The backup API is a bound-to-store snapshot on a local disk, but for any other storage options the backup API requires customization.

If you are required to store the backups in a different store location like Amazon’s Simple Storage

Service (S3), a local storage server, or in a remote data center, ReplicationHandler has to be

customized. Solr core libraries are available in open source that allows for any customization.

Elasticsearch

Elasticsearch provides an advanced option called ‘Snapshot API’ for backing up the entire cluster. The API will back up the current cluster state and related data and save it to a shared repository.

The first or initial backup process will be a complete copy of the data. The subsequent backup processes will snapshot the delta between the backup of fresh data with previous snapshots. Elasticsearch prompts end users to create a repository type, which can be chosen from a shared file system:

• Amazon S3 • Hadoop Distributed File System (HDFS) • Azure Cloud

This integration gives a greater flexibility for developers to manage their backups.

Backup Process

The backup options present in Apache Solr and Elasticsearch can be executed manually or can be automated. To automate the entire backup process, one has to write custom scripts that calls the relevant API or handler. Most engineering companies follow this model of writing custom scripts for backup automation.



PAGE 10 of 39


The backup also involves maintenance of the latest snapshots and archives. The management tasks involve key tasks like snapshot retrieval, archival, and expiration.

In an alternate approach, if the Solr or Elasticsearch cluster is set up in a cluster replication mode,

any one of the slave nodes is identified as backup server. The automation of the slave node

backup server needs a script written by the developer.

Amazon CloudSearch

Amazon CloudSearch inherently takes cares of the data that is stored and indexed, leaving a

lighter load for engineering and operations teams. Amazon CloudSearch self-manages all the data

backup and its management. The backups are internally maintained behind the scenes. In the

event of any hardware failure or other problem, Amazon CloudSearch restores the backup

automatically, and this process is not revealed to end users.

Conclusion

The default option in Apache Solr is only to back up to a ‘local disk’ store; it does not offer any other storage options as Elasticsearch does. However, the engineers can write their own handlers to manage the backup process.

Elasticsearch is packaged with multiple storage options plugins which gives added advantage for engineers.

Amazon CloudSearch relieves the users of the intricacies of the backup and its management

process. The IT operations or managed service team have a lesser role in the backup process as

the entire operations are managed behind the scenes by CloudSearch.

2.2 System upgrades and patch management

Patch management and system upgrades like OS patches and fixes are inevitable in operations and

administration. For any system, there is always a version upgrade, or maintenance on the OS, and

hardware or software changes.

Rolling Restarts Apache Solr and Elasticsearch both recommend using ‘Rolling Restarts’ for patch management,

operating system upgrades and other fixes. Rolling Restarts involve stopping and starting each

cluster node in the cluster sequentially. This allows the cluster to continue its operations while

each node is updated with the latest code, fixes, or patches while continuing to serve search

requests. Rolling Restarts is adopted when high availability is mandatory and downtime is not

allowable.

Sometimes, the Rolling Restarts require some intelligent decision making based on cluster

topology. If a cluster consists of shards and replicas, the order of restarting each node has to be

done decisively.



PAGE 11 of 39


Apache Solr

Apache’s ZooKeeper service acts as a stand-alone application and does not get upgraded

automatically when Apache Solr is upgraded, but it should be done manually at the same time.

Elasticsearch

Elasticsearch recommends disabling the ‘shard allocation’ configuration during node restart. This

informs Elasticsearch to stop re-balancing missing shards because the cluster will immediately

start working on node loss.

Amazon CloudSearch

Amazon CloudSearch internally manages all patches and upgrades related to its operating system.

The managed search service offering from Amazon CloudSearch monitors for when new features

are rolled out; upgrades are self-managed and immediately available to all customers without any

action on their part.

Conclusion

The patch management in Apache Solr and Elasticsearch has to be carried out manually using the

Rolling Restarts feature. Customers automate this process by developing custom scripts to do

system upgrades and patch management.

Patch management in Amazon CloudSearch is transparent to the customers. The upgrades and

patches done on Amazon CloudSearch are regularly updated in the ‘What’s New’ section of the

CloudSearch documentation.

2.3 Re-indexing

Any business application changes over its lifetime, as the business running it changes. The business

change has a direct effect on the data structure of the system’s persistent information store. The

search engine, which is seen as a secondary or alternate store, will eventually have to change its data

structure when required. Any changes to the search engine data structure will require a re-indexing

of the data.

Example: A product company started collecting ‘feedback’ from their customer for a given product.

The text string from the new field ‘feedback’ needs to be added into the search schema, and may

require re-indexing.

If the search data is not re-indexed after a structural change, the data that has already been indexed

could become inaccurate and the search results may behave differently than expected.

Re-indexing becomes a necessary process over a period of time as the application grows. It is also

identified as a common and mandatory admin operation executed periodically based on application

requirements.



PAGE 12 of 39


Apache Solr Apache Solr recommends re-indexing if there is a change in your schema definitions. The options

below are widely used by the Apache Solr user community.

• Create a fresh index with new settings. Copy all of the documents from the old index to the new one.

• Configure Data import handler with ‘SolrEntityProcessor’. The SolrEntityProcessor imports data from Solr instances or cores for a given search query. The SolrEntityProcessor has a limitation where it can only copy fields that are stored in the source index.

• Configure Data import handler with the source or origination data source. Push the data freshly to the new index

Elasticsearch

Elasticsearch proposes several approaches for data re-indexing. The following approaches are

usually combined:

Use Elasticsearch’s Scan and Scroll and Bulk APIs to fetch and push data into the new index.

Update or create an index alias with the old index name and delete the old index.

Use open source Elasticsearch plugins that can extract all data from the cluster and re-index the data. Most of these plugins internally use the Scan and Scroll and Bulk API (as mentioned above) which reduces development time.

Amazon CloudSearch

Amazon CloudSearch recommends data rebuilding when index fields are added or modified. Amazon CloudSearch expects to issue an indexing request after a configuration change. Whenever there is a configuration change, the CloudSearch domain status changes to ‘NEEDS INDEXING’. During the index rebuilding, the domain's status changes to ‘PROCESSING’, and upon completion the status is changed to ‘ACTIVE’.

Amazon CloudSearch can continue to serve search requests during the indexing process, but the configuration changes are not reflected in the search results. The re-indexing process can take some time for the changes to take effect. It is directly proportional to the amount of data volume in your index.

Amazon CloudSearch also allows document uploads while indexing is in progress, but the updates can become slower, if there are is large volume of document updates. During such a scenario, the uploads or updates can be throttled or paused until the Amazon CloudSearch domain returns to an ‘ACTIVE’ state.

Customers can initiate re-indexing by issuing the index-documents command using RESTful API, AWS command line interface (CLI), or AWS SDK. They can also initiate re-indexing from the CloudSearch management console.

Conclusion

Re-indexing in Apache Solr and Elasticsearch is mostly a manual process because it requires a

decision that factors data size, current request size, and offline hours.

Amazon CloudSearch manages the re-indexing process inherently and leaves much less to

administrators. The re-indexing time period is abstracted and not disclosed to administrators but

Amazon CloudSearch runs the re-indexing process based on the best practices mentioned above.



PAGE 13 of 39


Feature 3: Monitoring Monitoring server health is an essential daily task for operations and administration. In this section,

we will describe the built-in monitoring capabilities for all three search engines.

Apache Solr Apache Solr has a built-in web console for monitoring indexes, performance metrics, information about index distribution and replication, and information on all threads running in the Java Virtual Machine (JVM) at the time.

For more detailed monitoring, Java Management Extensions (JMX) can be configured with Solr that share runtime statistics as MBeans. The Apache Solr JVM container has built-in instrumentation that enables monitoring using JMX.

Elasticsearch

Elasticsearch has a management and monitoring plugin called ‘Marvel’. Marvel has an interactive console called ‘Sense’ that helps users to interact easily with Elasticsearch nodes. Elasticsearch has in-built diversified APIs that emit heap usage, garbage collection stats, file descriptions, and more. Marvel is strongly integrated with these APIs, and it periodically executes polling, collects statistics and stores the data back in Elasticsearch. Marvel’s interactive graph report dashboard allows administrators to query and aggregate historical stats data.

Amazon CloudSearch

Amazon CloudSearch recently introduced Amazon CloudWatch integration. The Amazon CloudSearch metrics can be used to make scaling decisions, troubleshoot issues, and manage clusters.

Amazon CloudSearch publishes four metrics into Amazon CloudWatch: SuccessfulRequests, Searchable Documents, Index Utilization, and Partition Count.

The CloudWatch metrics can be configured to set alarms, which can notify administrators through Amazon Simple Notification Service.

Conclusion

Apache Solr and Elasticsearch have integrations with in-built and external plugins. They can also support SaaS based monitoring plugins or custom plugins developed by the customers.

CloudSearch’s integration with CloudWatch shares some good metrics and it is expected to offer newer ones in the future.



PAGE 14 of 39


Feature 4: Schema, Data types, Dynamic Fields and Data Import/Export

4.1 Schema management

Schema: A schema is a definition of fields and field types used by the search system to organize data

within the document files it indexes.

Schema definition is the foremost task in the search data structure design. It is important that the

schema definition caters to all business requirements and is designed to suit the application.

Apache Solr and Elasticsearch Both Elasticsearch and Apache Solr can run the search application in ‘Schema-less’ and ‘Schema’ mode. Schema mode is suitable for application development or any production environments.

Schema-less is a very good option for entrants to get started. After server setup, users can start the application without a schema structure and create the field definitions on the search indexing. However, to have a production-grade application running, a proper schema structure becomes mandatory and the schema definition is a necessity.

Amazon CloudSearch

Amazon CloudSearch also allows users to set up search domains without any index fields. The index fields can be added anytime, but before any valid document indexing or any search request.

In addition, the CloudSearch management console has integration with Amazon Web Services like S3, DynamoDB, or can access a local machine from where the schema can be imported directly to CloudSearch domain. After the schema import, CloudSearch allows the user to edit the fields or add new fields. This is a convenient feature for a pre-built schema that is to be migrated to a CloudSearch domain.

Conclusion

Apache Solr and Elasticsearch can be started without any schema but they cannot be put into production use. Amazon CloudSearch allows creating domains without any index fields, but to have any index and search requests served the schema should be created.

The general best practice in schema management is to rehearse and design the schema suiting application requirements before finalizing the search structure. The underlying schema concept of all three search engines is consistent with this practice.

4.2 Dynamic fields

Dynamic fields are like regular field definitions which support wildcard matching. They allow the

indexing of documents without knowing the type of fields they contain. A dynamic field is defined

using a wildcard pattern (*) for first, last, or only character. All undefined fields go through dynamic

field rules which validate the pattern match configured with the dynamic field's indexing options.



PAGE 15 of 39


Apache Solr and Elasticsearch Apache Solr and Elasticsearch allow end users to set up dynamic fields and rules using RESTful API and schema configuration.

Amazon CloudSearch

In Amazon CloudSearch, dynamic fields can be configured using indexing options in the CloudSearch management console or using CloudSearch, RESTful API, or AWS SDK API.

Conclusion

If you are unsure about the schema structure or exact field names, dynamic fields come in handy. Amazon CloudSearch, Apache Solr, and Elasticsearch all allow the flexibility to configure dynamic fields. This helps the application development team to describe any omitted field definitions in the schema document.

4.3 Data types

There are a variety of data types supported by these search engines. The table below illustrates the

data field types supported by each search engine.

Data type Solr Elasticsearch CloudSearch String / Text Yes Yes Yes

Number types integer, double, float, long

byte, short, integer, long, float, double

integer, double

Date types Yes Yes Yes

Enum fields Yes Yes No

Currency Yes No No

Geo location / Latitude – Longitude

Yes Yes Yes

Boolean Yes Yes No

Array types Yes Yes Yes

Conclusion The most important data types like string, date, and number types are supported by all three search engines. Geo location data type, which is now regularly used by modern applications, is also supported by all search engines.

Engineers and developers may use an alternate data type if a particular data type is not supported for their chosen search engine. Example, ‘currency’ data type supported in Solr is not available in Elasticsearch and CloudSearch. During such cases, engineers use number type as an alternative



PAGE 16 of 39


data type for ‘Currency’.

4.4 Data import & export

The most important task in a search application development is data migration from origination

source to the search engine. The origination data can be of a data source like a database, or a file

system or a persistent store. To commence a search data set, it is required to migrate or import the

full data set from its origin to the search engine.

Likewise, extracting data from a search engine and exporting it to a different destination source is

also a crucial task but executed occasionally.

Apache Solr Apache Solr has in-built handler called Data import handler (DIH). The DIH provides a tool for migrating and/or importing data from the origin store. The DIH can index data from data sources such as

• Relational Database Management System (RDBMS) • Email • HTTP URL end point • Feeds like RSS and ATOM • Structured XML files

The DIH has more advanced features like Apache Tika integration, delta import, and transformers to quickly migrate the data.

The Apache Solr export handler can export the query result data to a Javascript Object Notification (JSON) or comma-separated values (CSV) format. The export query expects to sort and filter query parameters and returns only the stored fields. Users also have the option of developing a custom export handler and incorporate it with Solr core libraries.

Elasticsearch

Elasticsearch ‘Rivers’ is an elegant pluggable service which runs inside the Elasticsearch cluster. This service can be configured for pulling or pushing the data that is indexed into the cluster. Some of the popular Elasticsearch Rivers modules are CouchDB, Dropbox, DynamoDB, FileSystem, Java Database Connectivity (JDBC), Java Messaging Service (JMS), MongoDB, neo4j, Redis, Solr, Twitter, and Wikipedia.

However, ‘Rivers’ will be deprecated in the newer release of Elasticsearch, which recommends using official client libraries built for popular programming languages. Alternatively, the Logstash input plugin is also one of the identified tools that can be used to ship data into Elasticsearch.

For data export, Elasticsearch snapshot can be used for any individual indices or an entire cluster into a remote repository. This is discussed in detail in the section ‘Operations and Management - Backup’.

Amazon CloudSearch

Amazon CloudSearch recommends sending the documents in batches to upload on CloudSearch domain. A batch is a collection of add, update, and delete operations which should be described in


https://tika.apache.org/


PAGE 17 of 39


JSON or XML format.

Amazon CloudSearch limits a single batch upload to 5 MB per batch, but allows running parallel upload batches to reduce the time frame for full data upload. The number of parallel batch uploads is directly proportional to the CloudSearch instance types. Larger instance types have a higher upload capacity, while smaller instance types have lower. During such scenarios, the batch upload programs should intelligently threshold the uploads based on instance capacity.

Conclusion

Apache Solr has good handlers to export and import the data. In any case, if the options present are not viable, Apache Solr allows one to develop a new custom handler or customize an existing handler that can be used for data import and export.

Elasticsearch has integration with popular data sources in the form of ‘River’ modules or plugins. However, the future versions of Elasticsearch strongly recommend using Logstash input plugins or developing and contributing new Logstash input, as customization of a plugin is allowed in Elasticsearch.

Amazon CloudSearch does not have elaborate options like other two search engines. However by combining custom programs with bulk upload recommendations in Amazon CloudSearch, customers can successfully migrate data into CloudSearch.



PAGE 18 of 39


Feature 5: Search and Indexing features In this section, we will evaluate ‘Search and Indexing’ features present in the search engines we are

evaluating. This is a very important feature set as they are widely used by search application

engineers.

5.1 Analyzers, Tokenizers and Token filters

Generally speaking, the search engine prepares text strings for indexing and searching using

analyzers, tokenizers, and filters. These tools are frequently used by libraries configured for indexing

and searching the data. Most of the time, the libraries are composed in a sequential series.

• During indexing and querying, analyzer assesses the field text and tokenizes each block

of text into individual terms. Each token is a sub-sequence of the characters in the text.

• The token filter filters each token in the stream sequentially and applies its filter

functionality.

Apache Solr and Elasticsearch Apache Solr and Elasticsearch have multifaceted in-built libraries for analyzers, tokenizers and token filters. These libraries are packaged with search engine installable that can be configured during indexing and searching.

Although the analyzers can be configured for indexing and querying, the same series of libraries doesn’t need to be used for both operations. The indexing and searching operations can be configured to have different tokenizers and filters, as their goals can be different.

Search Engine Tokenizers Filters

Apache Solr

Standard, Classic, Keyword, Letter, Lower Case, N-Gram, Edge N-Gram, ICU, Path Hierarchy, Regular Expression Pattern, UAX29 URL Email, White Space

ASCII Folding, Beider-Morse, Classic, Common Grams, Collation Key, Daitch-MokotoffSoundex, Double Metaphone, Edge N-Gram, English Minimal Stem, Hunspell Stem, Hyphenated Words, ICU Folding, ICU Normalizer 2, ICU Transform, Keep Words, KStem, Length, Lower Case, Managed Stop, Managed Synonym, N-Gram, Numeric Payload Token, Pattern Replace, Phonetic, Porter Stem, Remove Duplicates Token, Reversed Wildcard, Shingle, Snowball Porter, Stemmer, Standard, Stop, Suggest Stop, Synonym, Token Offset Payload, Trim, Type As Payload, Type Token, Word Delimiter

Elasticsearch

Standard, Edge NGram, Keyword, Letter, Lowercase, NGram, Whitespace, Pattern, UAX Email URL, Path Hierarchy, Classic, Thai

Standard Token, ASCII Folding Token, Length Token, Lowercase Token, Uppercase Token, NGram Token, Edge NGram Token, Porter Stem Token, Shingle Token, Stop Token, Word Delimiter Token, Stemmer Token, Stemmer Override Token, Keyword Marker Token, Keyword Repeat Token, KStem Token, Snowball Token, Phonetic Token, Synonym Token, Compound Word



PAGE 19 of 39


Token, Reverse Token, Elision Token, Truncate Token, Unique Token, Pattern Capture Token, Pattern Replace Token, Trim Token, Limit Token Count Token, Hunspell Token, Common Grams Token, Normalization Token, CJK Width Token, CJK Bigram Token, Delimited Payload Token, Keep Words Token, Keep Types Token, Classic Token, Apostrophe Token

Amazon CloudSearch Amazon CloudSearch analysis scheme configuration is used for analyzing text data during indexing. The analysis schemes basically control:

• Text field content processing • Stemming • Inclusion of stopwords and synonyms • Tokenization (Japanese language) • Bigrams (Chinese, Japanese, Korean languages)

The following analysis options are executed when text fields are configured with an analysis scheme 1. Algorithmic stemming: Level of algorithmic stemming (minimal, light, and heavy) to

perform. The stemming levels vary depending on the analysis scheme language. 2. Stemming dictionary: A dictionary to override the results of the algorithmic stemming. 3. Japanese Tokenization Dictionary: A dictionary which specifies how particular characters

should be grouped into words (only for Japanese language). 4. Stopwords: A set of terms that should be ignored both during indexing and at search. 5. Synonyms: A dictionary of words that have the same meaning in the text data

Before processing the analysis scheme, Amazon CloudSearch tokenizes and normalizes the text data. During tokenization, the text data is split into multiple tokens; this is common behavior in all search engine text processing. During normalization, upper case characters are converted to lower case, and more formatting is applied.

After the tokenization and normalization processes are completed, stemming, stopwords, and synonyms are applied.

Conclusion

Apache Solr and Elasticsearch are packaged with varied libraries with distinct functions of analyzers, tokenizers, and filters. Also, these libraries are allowed to be customized which gives greater flexibility for the developers.

Amazon CloudSearch doesn’t carry sophisticated tokenizers or filter libraries like Apache Solr or Elasticsearch, but it has simplified the configuration. Amazon CloudSearch tokenizers and filters cover most common search requirements and use cases. This is ideal for developers who want to quickly integrate search functionality into their application stack.



PAGE 20 of 39


5.2 Faceting

Faceting is the composition of search results into categories or groups, based on indexed terms.

Faceting allows for categorizing search results into more sub-groups, which can be used as the basis

for filters or other searches. Faceting is also for efficient computation of search results by facets. For

example, facets for ‘Laptop’ search results can be 'Price', ‘Operating System’, 'RAM' or 'Shipping

Method’.

Faceting is a popular function that helps consumers filter through search results easily and

effectively.

Apache Solr Apache Solr has far advanced options in faceting ranging from simple to very advanced faceting behavior.

The below table details the parameters used during faceting. They can be grouped by field value, date, range, pivot, multi-select, and interval.

Facet grouping Parameters

Field value parameters

facet.field, facet.prefix, facet.sort, facet.limit, facet.offset, facet.mincount, facet.missing, facet.method, facet.enum.cache.minDffacet.threads

Date faceting parameters

facet.date, facet.date.start, facet.date.end, facet.date.gap, facet.date.hardend, facet.date.other, facet.date.include

Range faceting parameters

facet.range, facet.range.start, facet.range.end, facet.range.gap, facet.range.hardend, facet.range.other, facet.range.include

Pivot facet.pivot, facet.pivot.mincount

Interval facet.interval, facet.interval.set

Elasticsearch

Elasticsearch has deprecated facets and announced that they will be removed in a future release. The Elasticsearch team felt that their facet implementation was not designed from the ground up to support complex aggregations. Elasticsearch will be replacing facets with aggregations in their next release.

Elasticsearch says “An aggregation can be seen as a unit-of-work that builds analytic information over a set of documents. The context of the execution defines what this document set is (for example, a top-level aggregation executes within the context of the executed query/filters of the search request).”

Elasticsearch strongly recommends migrating from facets to aggregations. The aggregations are classified into two main families, Bucketing and Metric.

The following table lists the aggregations available in Elasticsearch.



PAGE 21 of 39


Elasticsearch Aggregators

Min Aggregation, Max Aggregation, Sum Aggregation, Avg Aggregation, Stats Aggregation, Extended Stats Aggregation, Value Count Aggregation, Percentiles Aggregation, Percentile Ranks Aggregation, Cardinality Aggregation, Geo Bounds Aggregation, Top hits Aggregation, Scripted Metric Aggregation, Global Aggregation, Filter Aggregation, Filters Aggregation, Missing Aggregation, Nested Aggregation, Reverse nested Aggregation, Children Aggregation, Terms Aggregation, Significant Terms Aggregation, Range Aggregation, Date Range Aggregation, IPv4 Range Aggregation, Histogram Aggregation, Date Histogram Aggregation, Geo Distance Aggregation, GeoHash grid Aggregation

Amazon CloudSearch

Amazon CloudSearch simplifies facet configuration when defining indexing options. These facets are targeted at common use cases like e-commerce, online travel, classifieds, etc. The facet can be of any field having data type as date, literal, or numeric field. This is done during CloudSearch domain configuration. Amazon CloudSearch also allows the buckets definition to calculate facet counts for particular subsets of the facet values.

The facet information can be retrieved in two ways:

Sort: returns facet information sorted either by facet counts or facet values.

Buckets: returns facet information for particular facet values or ranges

During searching, facet information can be fetched for any facet-enabled field by specifying the “facet.FIELD” parameter in the search request (‘FIELD’ is the name of a facet-enabled field).

Amazon CloudSearch does allow multiple facets which help to refine search results further. See the below example.

Example: "q=poet&facet.genres={}&facet.rating={}&facet.year={}&return=_no_fields"

Conclusion

All three search engines allow users to perform faceting with minimal effort. However, in terms of an advanced complex implementation, the approaches are different for each search engine.

5.3 Auto Suggestion

When a user types a search query, suggestions relevant to the query input are presented and as more

characters are typed by the user, refined suggestions are presented. This feature is called auto-

suggest. Auto-suggest is an appealing and useful requirement and employed in many search user

interfaces.

This feature can be implemented at the Search Engine level or at the Search Application level. Below

are some options available in these three search engines.



PAGE 22 of 39


Apache Solr Apache Solr has native support for the auto-suggest feature. It can be facilitated by using NGramFilterFactory, EdgeNGramFilterFactory, or TermsComponent. Usually, this Apache Solr feature is used in conjunction with jQuery or asynchronous client libraries for creating powerful auto-suggestion and user experience in the front-end applications.

Elasticsearch

Elasticsearch also has many edge n-grams, which are easy to set up, flexible, and fast. Elasticsearch introduced a new data structure, Finite State Transducer (FST), which resembles big graph data structure. This data structure is managed in memory, which makes it much faster than a term-based query could be. Elasticsearch also recommends using edge n-grams when query input and its word ordering are less predictable.

Amazon CloudSearch

Amazon CloudSearch offers ‘Suggesters’ to achieve auto-suggest. CloudSearch Suggesters are configured based on a particular text field. When Suggesters are used for querying with a search string, CloudSearch lists all documents where the search string in the Suggester field begins with that search string. Suggesters can be configured to find matches for the exact query, or to perform a fuzzy matching process to correct the query string. The ‘Fuzzy Matching’ can be defined with fuzziness level Low, High or Default.

Suggesters also can be configured with SortExpression, which computes a score for each one. It’s important to do domain indexing when a new Suggester is configured. Suggestions will not be reflected until all of the documents are indexed.

Conclusion

Amazon CloudSearch provides simple yet powerful ‘Suggest’ implementation, which is sufficient for most of the applications. If you are looking for advanced options or any further customizations on ‘Suggestions’, Apache Solr and Elasticsearch offer some good options.



PAGE 23 of 39


5.4 Highlighting

Highlighting is a way of giving formatting clues to end users in the search results. It is a valuable

feature, where the front-end search applications highlight search snippets of text from each search

result. This function conveys to end users why the result document matched their query. In this

section, we will describe the options present in all three search engines.

Apache Solr Apache Solr includes document text fragments, which are matched in the query response. These text fragments are included in the response as a highlighted section that is used as a cue by search clients for representation. Apache Solr is packaged with good highlighting collections which give control over the text fragments, fragment size, fragment formatting, and so on. These highlighting collections can be incorporated with Solr Query parsers and Request Handlers.

Apache Solr comes with three highlighting utilities

• Standard Highlighter • FastVector Highlighter • Postings Highlighter

Standard Highlighter is most commonly used by search engineers because it is a good choice for a wide variety of search use-cases. The FastVector Highlighter is ideal for large documents and highlighting text in a variety of languages. The Postings Highlighter works well for full-text keyword search.

Elasticsearch

Elasticsearch also allows for highlighting search results on one or more fields. The implementation uses a Lucene based highlighter, fast-vector-highlighter and postings-highlighter. In Elasticsearch, the highlighter can be configured in the query to force a specific highlighter type. This is a very flexible option for developers to choose a specific highlighter to suit their requirements.

Like Apache Solr, the three highlighters present in Elasticsearch emulate the same behavior which is seen in Solr because these highlighters are inherited from the Lucene family.

Amazon CloudSearch

Amazon CloudSearch simplifies the highlighting by specifying the highlight.FIELD parameter in the search request. Amazon CloudSearch returns excerpts with the search results to show where the search terms occur within a particular field of a matching document.

For example: Search terms ‘Smart Phone’ is highlighted for the description field:

Highlights": {"description": "A *smartphone* is a mobile phone with an advanced mobile operating system. They typically combine the features of a cell phone with those of other popular mobile devices, such as personal digital assistant (PDA), media player and GPS navigation unit. A *smartphone* has a touchscreen user interface and can run third-party apps, and are camera phones."}

Amazon CloudSearch also provides controls like number of search term occurrences within an excerpt, how they should be highlighted, plain text or HTML and so on.

Conclusion

From a development perspective, all three search engines provide easy and simple highlighting implementations. If you are looking for different and more advanced highlighting options, Apache Solr and Elasticsearch have some good features.



PAGE 24 of 39


Feature 6: Multilingual support Multilingualism is a very important feature for global applications which cater to non-English

speaking geographies. A leading information measurement company’s survey reveals that search

engines built with multilingual features are emerging and successful because of native language

support, and focus on the cultural background of the users.

Business Impact: A multilingual search is an effective marketing strategy to get the attention of

consumers. In e-commerce, a platform to do more business is created when the language is in the

native tongue of the customer.

Apache Solr Apache Solr is packaged with multilingual support for most common languages. Apache Solr carries many language-specific tokenizers, and filters libraries which can be configured during indexing and querying.

Apache Solr engineering forums recommend using multi-core architecture where each core manages one language. Solr also supports language detection using Tika and LangDetect detection features. This helps to map the text data to language-specific fields during indexing.

Elasticsearch

Elasticsearch has incorporated a vast collection of language analyzers for most commonly spoken languages. The primary role of the language analyzer is to split, stem, filter, and apply required transformations specific to the language.

Elasticsearch also allows a user to define a custom analyzer that can be a base extension of another analyzer.

Amazon CloudSearch

Amazon CloudSearch has strong support for language-specific text processing. Amazon CloudSearch has pre-defined default analysis schemes support to 34 languages. Amazon CloudSearch processes the text and text-array fields based on the configured language-specific analysis scheme.

Amazon CloudSearch also allows a user to define a new analysis scheme that can be an extension of the default language analysis scheme.

Conclusion

All three search engines have ample and effective support features for widely spoken international languages.



PAGE 25 of 39


Languages Support

The table below lists the languages supported by each search engine

Search engine Languages supported

Apache Solr

Arabic, Brazilian, Portuguese, Bulgarian, Catalan, Chinese, Simplified Chinese, CJK, Czech, Danish, Dutch, Finnish, French, Galician, German, Greek, Hebrew, Lao, Myanmar Khmer, Hindi, Indonesian, Italian, Irish, Japanese, Latvian, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Scandinavian, Serbian, Spanish, Swedish, Thai and Turkish

Elasticsearch

Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Kurdish, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Thai and Turkish

Amazon CloudSearch

Arabic, Armenian, Basque, Bulgarian, Catalan, Chinese - Simplified, Chinese - Traditional, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hebrew, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Latvian, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish and Thai



PAGE 26 of 39


Feature 7: Protocol & API Support

7.1 Request and Response formats

Search engine Request formats Response formats

Apache Solr XML, JSON, CSV JSON, XML, CSV

Elasticsearch JSON XML, JSON

Amazon CloudSearch XML, JSON XML, JSON

7.2 External Integrations

Search engine

Integrations available

Apache Solr

Drupal, Magento, Django, ColdFusion, Wordpress, OpenCMS, Plone, Typo3, ez Publish, Symfony2, Riak, DataStax Enterprise Search, Cloudera Search, Hortonworks Data Platform, MapR

https://wiki.apache.org/solr/IntegratingSolr#Integrating_Solr_With_Other_.28Non_Search.29_Applications

Elasticsearch

Drupal, Django, Symfony2, Wordpress, CouchBase, SearchBlox, Hortonworks Data Platform, MapR

http://www.Elasticsearch.org/guide/en/Elasticsearch/client/community/current/integrations.html

Amazon CloudSearch

7.3 Protocols Support

Search engine Protocols support

Apache Solr HTTP, HTTPS

Elasticsearch HTTP, HTTPS

Amazon CloudSearch HTTP, HTTPS




http://www.elasticsearch.org/guide/en/elasticsearch/client/community/current/integrations.html

http://www.elasticsearch.org/guide/en/elasticsearch/client/community/current/integrations.html


PAGE 27 of 39


Feature 8: High Availability All three search engines are architected for

• High availability (HA)

• Replication

• Scaling design principles

In this next section, we will discuss high availability options present in these three search engines.

8.1 Replication

Replication is copying or synchronizing the search index from master nodes to slave nodes for

managing the data efficiently.

Replication is a key design principle exercised in high availability searches and scaling. From a High

Availability perspective, replication can be effective for both HA and failovers from master nodes

(shards or leaders) to slave nodes (replicas). Replication from a scaling perspective is used to scale

the slave or replica nodes when the requests traffic increases.

Apache Solr Apache Solr supports two models of replication, namely legacy mode and SolrCloud. In legacy mode, the replication handler copies data from the master node index to slave nodes. The master server manages all index updates and the slave nodes handle read queries. This segmentation of master and slave allows scaling Solr clusters to deliver heavy volume loads.

Apache SolrCloud is a distributed advanced cluster setup using Solr nodes designed with high availability and fault-tolerance. Unlike legacy mode, there is no explicit concept of "master/slave" nodes. Instead, the search cluster is categorically split into leaders and replicas. The leader has the responsibility to ensure the replicas are updated with the same data stored in the leader. Apache Solr has a configuration called ‘numShards’ which defines number of shards (leaders). During start-up, the core index is split across the ‘numShards’ (number of shards) and the shards are represented as leaders. The nodes that are attached in the Solr cluster after the initial ‘numShards’ will be automatically assigned as replicas for the leaders.

Elasticsearch

Elasticsearch follows a similar concept to SolrCloud. In brief, an Elasticsearch index can be split into multiple shards and each shard can be replicated into any number of nodes (0, 1, 2 …n). When replication is completed, the index will have primary shards and replica shards. During index creation, the number of shards and replicas are defined. The number of replicas can be dynamically changed, but the shards count cannot.

Apache Solr and Elasticsearch

Both Apache Solr and Elasticsearch support synchronous and asynchronous replication models. If the replication is configured in ‘synchronous’ mode, the primary (leader) shard will wait for successful responses from the replica shards before returning commit transaction. If the model is ‘asynchronous’, the response is returned to the client as soon as the request is executed on the primary or leader shard. The request to the replicas is forwarded asynchronously.

The diagram below depicts the replication concept which is followed in Solr and Elasticsearch.



PAGE 28 of 39


Replication handling in Apache Solr and Elasticsearch

S1 Node 1 Shard 1 of the cluster

S2 Node 2 Shard 1 of the cluster

R1 Node 3 Replica 1 of Shard 1




Amazon CloudSearch

Amazon CloudSearch is simple and refined when it comes to handling replication and streamlines the job of search engineers and administrators. During the configuration of scaling, Amazon CloudSearch prompts for the desired replication count which should be based on load requirements.

Amazon CloudSearch will automatically scale up and scale down the replicas for a domain based on the requests traffic and data volume, but not below the desired replication count. In Amazon CloudSearch, the replication scaling option can be changed at any time. If the scale requirement is temporary, (for example, anticipated spikes because of a seasonal sale) the desired replication count of the domain can be pre-scaled, and then the changes reverted after the requests volume returns to a steady state. Modifying the replication count does not require any index rebuilding but the replica sync completion is dependent on the size of search index.

The following describes the benefits of Amazon CloudSearch replication model.

• The search instance capacity is automatically replicated and load is distributed, the search layer is robust and highly available at all times.

• Improved fault tolerance. If any one of the replicas is down, the other replica(s) will continue to handle requests while the failed replica is in recovery mode.

• The entire process of scaling and distribution is automated and avoids manual intervention and support.



PAGE 29 of 39


Conclusion

All three search engines have a good base to support the ‘replication’ feature. Apache Solr and Elasticsearch allow defining your own replication topology which can be configured for synchronous and asynchronous replication. They can be manually or automatically scaled based on application requirements and by writing custom programs. However, substantial managed service operations are required if the cluster replication is set up in enterprise scale.

Amazon CloudSearch fully manages the replication by managing scaling, load distribution, and fault tolerance. This simplicity saves operations costs for the enterprises and companies.

8.2 Failover

Failover is a back-end operation that switches to secondary or standby nodes in the event of primary

server failure. Failover is identified as an important fault tolerance function for systems with lower or

zero downtime requirements.

Apache Solr and Elasticsearch When an Apache Solr or Elasticsearch cluster is built with shards and replicas, the cluster inherently becomes fault-tolerant and mechanically supports failover.

During any failure, a cluster is expected to support the operations while the failed node is put into recovery state. Both the Apache Solr and Elasticsearch documentation strongly recommend a distributed cluster setup to protect user experience from application or infrastructure failure.

In the event of all nodes storing shards and replicas failing, then the client requests will also fail. If the shards are set to tolerant configuration, partial results can be returned from the available shards. This behavior is anticipated in both Apache Solr and Elasticsearch.

The representation below depicts how failover is handled in cluster. This flow is applicable for both Solr and Elasticsearch.



PAGE 30 of 39


Node Replica 1 Replica 2

SHARD 1 – Node Number 1 SHARD 1 FIRST REPLICA – Node Number 3

SHARD 1 SECOND REPLICA – Node Number 5

SHARD 2 – Node Number 2 SHARD 2 FIRST REPLICA – Node Number 4

SHARD 2 SECOND REPLICA – Node Number 6

The below table illustrates the failure scenarios in a Search cluster.

Scenario A If SHARD1 fails, then one of its replica nodes, either Node number 3 or Node number 5 is chosen as leader.

Scenario B If SHARD2 fails, then one of its replica nodes, either Node number 4 or Node number 6 is chosen as leader.

Scenario C If SHARD 1 REPLICA1 fails, then Shard 1 Replica 2 continues to support replication and as well serve the requests.

Scenario D If SHARD 2 REPLICA1 fails, then Shard 2 Replica 2 continues to support replication and as well serve the requests.

Elasticsearch uses internal Zen Discovery to detect failures. If the node holding a primary shard dies, then a replica is promoted to the role of primary. Apache Solr uses Apache ZooKeeper for Co-ordination, failure detection, and leader voting. ZooKeeper initiates leader election process between replicas during a leader/primary shard failure.

Amazon CloudSearch

Amazon CloudSearch has built-in failover support. Amazon CloudSearch recommends scaling options and availability options to increase fault tolerance in the event of a service disruption or node failures.

When Multi-AZ is turned on, Amazon CloudSearch provisions the same number of instances in your search domain in the second availability zone within that region. The instances in the primary and secondary zones are capable of handling a full load in the event of any failure.

In the event of a service disruption or failure in one availability zone, the traffic requests are automatically redirected to the secondary availability zone. In parallel, Amazon CloudSearch self-heals the cluster in failure, and Multi-AZ restores the nodes without any administrative intervention. During this switch, the inflight queries might fail, and they will need to be retried from the front–end application side.

By increasing the partitions and replicas in the Amazon CloudSearch scaling options, failover support can be improved. If there's a failure in one of the replicas or partitions, the other nodes (replica or partition) will handle requests and support while it is being recovered.

Amazon CloudSearch is very sophisticated in terms of handling failure, as the node health is continuously monitored. In the event of infrastructure failures, the nodes are automatically recovered or replaced.

Conclusion

Failover can be architected by applying techniques like replication, sharding, service discovery, and failure-detection services. Apache Solr and Elasticsearch advocate building your search system in ‘Cluster mode’ to address failover. They undertake that responsibility by employing service discovery which can detect unhealthy nodes. The service discovery maintains the cluster



PAGE 31 of 39


information and balances the search cluster when nodes are detected for failures.

Amazon CloudSearch supports failover for single node as well as for cluster mode. Behind the scenes, CloudSearch continuously monitors the health of the search instances and they are automatically managed during failures.



PAGE 32 of 39


Feature 9: Scaling The ability to scale in terms of computing power, memory, or data volume is essential in any data

and traffic bound applications. Scaling is a significant design principle employed to improve

performance, balancing and high availability.

Over time, the search cluster is expected to be scaled horizontally (scale out) or vertically (scale up)

depending upon the needs.

Scale-up is the process of moving from a small server to a large server. Scale-out is the process of

adding multiple servers to handle the load. The scaling strategy should be selected based on

application requirements.

Apache Solr and Elasticsearch Scaling an Apache Solr or Elasticsearch application involves manual processes. These can include a simple server addition task or advanced tasks like cluster topology changes, storage changes, or infrastructure upgrades.

If vertical scaling takes place, the search cluster needs to follow processes like new setup and configuration, downtime, node restarts, etc. If scaling is horizontal, the process may involve re-sharding, rebalancing, or cache warming.

While a search cluster system can benefit from powerful hardware, vertical scaling has its own limitations. Upgrading or increasing the infrastructure specifications on the same server can involve tasks like:

• New setup • Backup • Down time • Application re-testing

The scaling out process is identified as a relatively easier task compared to scaling up.

An expert search administrator (Apache Solr or Elasticsearch) is usually posted to keep a close watch on the performance of the search servers. Infrastructure and search metrics play a key role in administrator decision making.

When these metrics increase beyond the threshold of a particular server and start affecting overall performance, the new server(s) have to be manually spawned. Also, the scale up task can expand to index partitioning, auto-warming, caching and re-routing/distribution of the search queries to the new instances. It requires a Solr expert on your team to identify and execute this activity periodically.

Sharding and Replication

Though scaling up, scaling out, and scaling down involve manual work, technology-driven companies automate this process by developing custom programs. These smart programs continuously monitor the cluster group and make decisions to do elastic scaling. This output is quite similar to AutoScaling’s offering.

In terms of administration functionality, both Apache Solr and Elasticsearch offer scaling



PAGE 33 of 39


techniques called Sharding and Replication.

Sharding (which means partitioning) is a method in which a search index is split into multiple logical units called "shards". If the indexed documents exceed the collection’s physical size, then sharding is recommended by administrators. When sharding is enabled the search requests are distributed to every shard in the collection, results are individually collected and then merged.

Another scaling technique, replication, (See 8.1 Replication - discussed in detail) allows adding new servers with redundant copies of your index data to handle higher concurrent query loads by distributing the requests around to multiple nodes.

Amazon CloudSearch

Amazon CloudSearch is a fully managed search service; it scales up and down seamlessly as the amount of data or query volume increases. CloudSearch can be scaled based on the data or based on the requests traffic. When the search data volume increases, CloudSearch can be scaled from a smaller instance type to a larger search instance type. If the capacity of largest search instance type is also exceeded then CloudSearch partitions the search index across multiple search instances (Sharding technique).

When traffic and concurrency grows, Amazon CloudSearch deploys additional (replicas) search instances to support traffic load. This automation eases the complexity and manual labour required in the scaling out process. Conversely, when the traffic drops, Amazon CloudSearch scales down your search domain by removing the additional search instances in order to minimize costs.

The Amazon CloudSearch management console allows users to configure the desired partition count and the desired replication count. The AWS console also allows changing of the instance type (scaling up) anytime. This inherent behavior of elastic scaling makes one of the most important points in favor of Amazon CloudSearch.

Conclusion

Scaling in search is implemented in the form Sharding and Replication. All three search engines have a strong scaling support for setting up their search tier in ‘cluster mode’.

Scaling in Apache Solr and Elasticsearch often requires administration as there is no direct hard and fast rule. Techniques like elastic scaling can implemented only up to a limit and when cluster grows further, manual intervention and thought process is required. Vertical scaling in Apache Solr and Elasticsearch is even more delicate. It requires individual management of the nodes in the cluster and executed by using techniques like ‘Rolling restarts’ and custom scripts.

Amazon Cloud Search takes away all the operation intricacies from the administrators. The desired

partition count and desired replication count option in CloudSearch will automatically scale up and

scale down based on the volume of data and requests traffic. This saves lot of efforts and cost on

operations and management.



PAGE 34 of 39


Feature 10: Customization At times, the search system or its software may not have support for a specific feature or built-in

integration with other systems. In such cases, most open source software allows developers to

customize and extend their desired features as plugins, extensions or modules. Often, the developer

community shares extension libraries which are helpful for a practical cause. These libraries can be

customized and integrated with the system.

Apache Solr and Elasticsearch Apache Solr and Elasticsearch both belong to the same source breed, allowing customizations on:

• Analyzers • Filters • Tokenizers • Language analysis • Field types • Validators • Fall back query analysis • Alternate query custom handlers

Since both products are open source, the developers can customize or extend the libraries to fit the required feature modifications through plugins and libraries. The build and deployment becomes a developer’s responsibility after the extending the code base.

Apache Solr and Elasticsearch have many plugin extensions that will allow developers to add custom functionality for a variety of purposes. These plugins are configured as special libraries and refer to the application using configuration mapping.

Amazon CloudSearch

Amazon CloudSearch does not allow for any customizations. The search features in Amazon CloudSearch are offered by AWS after much careful thought and collective feedback from the customers. The Amazon CloudSearch team continually evaluates new features and rolls them out proactively.

Conclusion

Amazon CloudSearch has a highly capable feature set to develop search systems. However, if you anticipate strong customization on your search functionalities, Apache Solr or Elasticsearch are better choices as their search core libraries are open sourced. It is also important to note that any customization in the core libraries leaves the build and deployment process responsibility to the developer. The customization also needs to be maintained for every version upgrade or newer release of your search engine.



PAGE 35 of 39


Feature 11: More

11.1 Client libraries

Client libraries are required for communicating with search engines. They are essential for developers

as they provide essential information to the connecting search engine and allow applications to easily

interact with high-level libraries.

Apache Solr Apache Solr has an open source API client to interact with Solr using simple high-level methods. The client libraries are available for PHP, Ruby, Rails, AJAX, Perl, Scala, Python, .NET, and JavaScript.

Elasticsearch

Elasticsearch provides official clients for Groovy, JavaScript, .NET, PHP, Perl, Python, and Ruby. There are other community-provided client libraries that can be integrated with Elasticsearch.

Open source

Other than official and open source client APIs, Elasticsearch and Apache Solr can be integrated using the RESTful API. The REST client can use a typical web client developed in the favored programming language or even called from a normal command line.

Amazon CloudSearch

Amazon CloudSearch exposes a RESTful API for configuration, document service and search.

• The configuration API can be used for CloudSearch domain creation, its configuration and end to end management.

• The document service API enables the user to add, replace, or delete documents in your Amazon CloudSearch.

• The search API is used for search or suggestion requests to your Amazon CloudSearch domain.

Alternatively, AWS also shares a downloadable SDK package, which simplifies coding. The SDK is available for popular languages like Java, .NET, PHP, Python, and more. The SDK APIs are built for most Amazon Web services, including Amazon S3, Amazon EC2, CloudSearch, DynamoDB, and more. The SDK package includes the AWS library, code samples, and documentation.


https://wiki.apache.org/solr/IntegratingSolr#Solr_Client_Libraries_.2F_Language_Bindings


PAGE 36 of 39


Feature 12: Cost

From an overall perspective, Cost is a very important factor and companies always endeavor ways to

reduce Total cost of Ownership (TCO). In this section, we will see the Cost components in these three

search engines.

Apache Solr and Elasticsearch The cost factor in Apache Solr and Elasticsearch includes infrastructure resources cost, managed services cost and people resources cost. For any type of deployment, the servers cost and engineers cost are essential. The commitment to continuous admin operations depends on application requirements and its criticality.

Amazon CloudSearch

Amazon CloudSearch cost component includes server costs and engineers cost and they are essential for any search deployment like the above two. Amazon CloudSearch being a fully-managed service covers the managed services as part of the server costs. Also, Amazon CloudSearch does not charge during the beginning of service usage but charges at the end of the month based on CloudSearch usage.

Conclusion

The net operating costs are essentially the same across all three search engines, but people costs will be 30% more for self-managed Apache Solr or Elasticsearch compared to Amazon CloudSearch.

For Example, A highly important and critical search application will require 24 * 7 support and managed services. This cost incurred as part of Managed services which is an additional one in Apache Solr and Elasticsearch deployments.

A detailed TCO Analysis between Apache Solr, Elasticsearch and Amazon CloudSearch can be read here.

Link:



PAGE 37 of 39


Conclusion Search is an indispensable feature in most business applications.

Apache Solr and Elasticsearch are time proven solutions. Many larger organizations have used

Apache Solr and Elasticsearch for years, but are now looking for greater operational efficiency and

cost effectiveness. On the other hand, companies looking for innovative ways grow their businesses

and provide value. In the recent years, a huge number technology companies have started to employ

the benefits of using cloud-based search services, mainly in terms of getting started and then

accommodating growth without the need to switch vendors to do so. When scalability, cost, and

speed-to-market are primary concerns, we recommend using some form of cloud service. And if you

want to enjoy the benefits of a cloud solution built on the architecture of Apache Solr, we

recommend Amazon CloudSearch.



PAGE 38 of 39


About the Authors

Dwarak is a Principal Architect at 8KMiles with more than decade hands-on experience in Cloud

Computing, Big Data, Web technologies and Product Management. He has varied and

progressive experience in architecting distributed Web and Enterprise systems and products.

He is also disciplined with deep domain knowledge in the banking, finance, retail, and e-

commerce industries. At present, he oversees technology consulting, architecture, delivery and

customer end to end transformational programs at 8KMiles.

Dwarakanath Ramachandran

Harish is the Chief Technology Officer (CTO) and Co-Founder of 8KMiles. Harish has more than

decade of experience in architecting and developing cloud computing, e-commerce and mobile

application systems. He has also built large Internet banking solutions that catered to the

needs of millions of users, where security and authentication were critical factors. He is

responsible for the overall technology direction of the 8KMiles products and services in Cloud,

Big Data and Mobility Space. Harish is a thought leader in Cloud related technologies, an

Advisor and has many followers for his blogs.

Harish Ganesan



PAGE 39 of 39


About 8KMiles

8KMiles is a solutions company that is focused on helping organizations of all sizes to integrate Cloud, Identity, and Big Data into their IT and business strategies. 8KMiles’ team of experts, located in North America and India, offer a host of services and solutions such as Cloud, Federated Identity Consulting, Cloud Engineering, Migration, Big Data services, and Managed Services on Amazon Web Services. 8KMiles offers specialized expertise in matured verticals such as Pharma, Retail, Media, Travel, and Healthcare. Visit us at www.8kmiles.com


Date post:	04-Apr-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times