RLI Language Identiﬁer RLI Sorted Languages … · the academic world. Supported Languages Search...

Every language, including English, presents unique and difficult challenges for search applications to deliver relevant and precise results. Rosette® Base Linguistics (RBL) enables enterprise applications to effectively search or process text in many languages by providing a complete set of linguistic services. RBL enriches the original text in its native language for best-of-class natural language processing, improving speed, and accuracy.

As linguistics experts with deep understanding at the intersection of language and technology, Basis Technology continually improves the Rosette product family with language additions, feature updates, and the latest innovations from the academic world.

Supported Languages

Search many languages with high accuracy 40

KEY FEATURES

- Simple API

- Fast and scalable

- Industrial-strength support

- Easy installation

- Flexible and customizable

- Java or C++

- Component of the Rosette SDK

- Customizable features such as user

dictionaries, orthographic normalization,

and script conversion

- Built to work with Apache Solr™ and

Elasticsearch

- Cloudera certified partner

Select Customers

www.basistech.com [email protected]

+1 617-386-2090

Start using RBL today Try our free product evaluation

www.basistech.com

Verb Determiner

Preposition Determiner

Noun

Noun Noun

Noun

Noun Punctuation

Conjunction

Preposition Adjective

Adjective

Improve the speed and

accuracy of your search

application with advanced

linguistic analysis .

Rosette®

BIG TEXT ANALYTICS

RES

RNT

RNI

REX

RBL

RLILanguage Identifier Identify languages and encodings

Base Linguistics Search many languages with high accuracy

Entity Extractor Tag names of people, places, and organizations

Name Indexer Match names between many variations

Name Translator Translate foreign names into English

CategorizerCategorize Everything In Sight

Sentiment AnalyzerDetect The Sentiments Of Your Text

Entity Resolver Make real-world connections in your data

Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL

RLI ROSETTELanguage Identifier

ROSETTEBase Linguistics

ROSETTEEntity Extractor

ROSETTEName Indexer

ROSETTEName Translator

ROSETTECategorizer

ROSETTESentiment Analyzer

ROSETTEEntity Resolver

RCA

RSA

RCA

RSA

TOKENIZATION

Many search tools use bigrams to understand languages written without spaces between words. This results in a larger index size and a reduction in relevancy. RBL, in contrast, accurately identifies and separates each word through advanced statistical modeling. The resulting token output (also known as segmentation) minimizes index size, enhances search accuracy, and increases relevancy.

DECOMPOUNDING

RBL breaks down compound words into sub-components and delivers each individual element to be indexed. This is especially useful for increasing search relevancy in languages such as German and Korean.

WESTERN EUROPE- Catalan*- Czech- Danish- Dutch- English- Finnish*- French- German- Greek- Italian- Norwegian- Portuguese- Spanish- Swedish

EASTERN EUROPE- Albanian*- Bulgarian*- Croatian*- Estonian*- Hungarian- Latvian*- Polish- Romanian- Russian- Serbian*- Slovak*- Slovenian*- Turkish- Ukranian*

Search Engines

Advanced Morphological Features

Available Languages

LEMMATIZATION

Most search engines utilize a crude method of chopping off characters at the end of a word in the hopes of removing unimportant differences. This method, called stemming, often results in extra recall and poor precision. Instead, RBL finds the true dictionary form of each word, known as a lemma, by using vocabulary, context, and advanced morphological analysis. Indexing the root form increases search relevancy and slims the search index by not indexing all inflected forms. Alternative lemmas are also made available to supplement indexing.

PART OF SPEECH TAGGING

As part of the lemmatization process, statistical modeling is used to determine the correct part of speech, even with ambiguous words.

Each token is then tagged for enhanced comprehension and search relevancy.Because different languages have different grammars, part-of-speech tags differ.

Rosette supports the Universal POS Tag standard from which the developer can map to Penn Treebank or other POS tag systems.

Compatibility

MIDDLE EAST- Arabic- Hebrew- Pashto- Persian- Urdu

ASIA- Chinese, Simplified- Chinese, Traditional- Indonesian- Japanese- Korean- Malay*- Thai

Example: GermanSamstagmorgen is a compound word formed with Samstag (Saturday) and morgen (morning). Decompounding allows for an appropriate match when searching for "Samstag".

Example: EnglishLinguistic analysis is useful for every language; lemmatization for English improves recall and precision.

NOUN PHRASE EXTRACTION

Certain nouns, especially proper names, canbe very tricky to identify as a single entity.RBL groups the nouns and their modifiers, which is useful in document clustering and concept extraction.

SENTENCE DETECTION

The start and end of each sentence is automatically identified even though punctuation use may be ambiguous.

CHALLENGE QUERY STEM LEMMA

Two unrelated words may share a stem.

animalsanimated

anim animalanimate

Stemming may deliver unintended results.

several sever several

Irregular verbs and nouns stump the stemmer.

spoke spoke speak (v.)spoke (n.)

WEST COAST

1700 Montgomery St.San Francisco, CA 94111

FEDERAL

2553 Dulles View Dr.Suite 450Herndon, VA 20171

HEADQUARTERS

One Alewife CenterCambridge, MA 02140

EUROPE

Furzeground WayMiddlesex UB11 1BD, UK

ASIA

9-6 Nibancho, Chiyoda-kuTokyo 102-0084, Japan

Code Base Platform Support

Example: Chinese Consider the problem of indexing “Beijing University Biology Department” and a subsequent search for “student”:

Beijing University

Biology Department

(Student)

INDEX

BIGRAMMING

RBL MORPHOLOGICAL TOKENIZATION

SEARCH

学

学

学

4 51 2

1 2

652 3 3 4 6 7

Beijing

Beijing University Biology Department

(non-word) University (Student) Biology Dept.

(non-word)

"Student" Incorrectly hits “Beijing University Biology Department”

Correctly misses “Beijing University Biology Department”

* Limited Support

© 2015 Basis Technology Corporation. “Basis Technology Corporation” , “Rosette”, and “Highlight” are registered trademarks of Basis Technology Corporation. “Big Text Analytics” is a trademark of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the property of their respective owners. (2015-06-29-RBL)

Rosette®

BIG TEXT ANALYTICS

RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA

Rosette®

BIG TEXT ANALYTICS

RES

RNT

RNI

REX

RBL









Better Search

Tagged Entities

Real Identities

Matched Names

Sorted Languages

Translated Names

Sorted Content

Actionable Insights

RES

RNT

RNI

REX

RBL




ROSETTEName Indexer


ROSETTECategorizer



RCA

RSA

RCA

RSA

Date post:	22-May-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

RLI Language Identiﬁer RLI Sorted Languages … · the academic world. Supported Languages Search...

Documents