Every language, including English, presents unique and difficult challenges for search applications to deliver relevant and precise results. Rosette® Base Linguistics (RBL) enables enterprise applications to effectively search or process text in many languages by providing a complete set of linguistic services. RBL enriches the original text in its native language for best-of-class natural language processing, improving speed, and accuracy.
As linguistics experts with deep understanding at the intersection of language and technology, Basis Technology continually improves the Rosette product family with language additions, feature updates, and the latest innovations from the academic world.
Supported Languages
Search many languages with high accuracy 40
KEY FEATURES
- Simple API
- Fast and scalable
- Industrial-strength support
- Easy installation
- Flexible and customizable
- Java or C++
- Component of the Rosette SDK
- Customizable features such as user
dictionaries, orthographic normalization,
and script conversion
- Built to work with Apache Solr™ and
Elasticsearch
- Cloudera certified partner
Select Customers
www.basistech.com [email protected]
+1 617-386-2090
Start using RBL today Try our free product evaluation
www.basistech.com
Verb Determiner
Preposition Determiner
Noun
Noun Noun
Noun
Noun Punctuation
Conjunction
Preposition Adjective
Adjective
Improve the speed and
accuracy of your search
application with advanced
linguistic analysis .
Rosette®
BIG TEXT ANALYTICS
RES
RNT
RNI
REX
RBL
RLILanguage Identifier Identify languages and encodings
Base Linguistics Search many languages with high accuracy
Entity Extractor Tag names of people, places, and organizations
Name Indexer Match names between many variations
Name Translator Translate foreign names into English
CategorizerCategorize Everything In Sight
Sentiment AnalyzerDetect The Sentiments Of Your Text
Entity Resolver Make real-world connections in your data
Better Search
Tagged Entities
Real Identities
Matched Names
Sorted Languages
Translated Names
Sorted Content
Actionable Insights
RES
RNT
RNI
REX
RBL
RLI ROSETTELanguage Identifier
ROSETTEBase Linguistics
ROSETTEEntity Extractor
ROSETTEName Indexer
ROSETTEName Translator
ROSETTECategorizer
ROSETTESentiment Analyzer
ROSETTEEntity Resolver
RCA
RSA
RCA
RSA
TOKENIZATION
Many search tools use bigrams to understand languages written without spaces between words. This results in a larger index size and a reduction in relevancy. RBL, in contrast, accurately identifies and separates each word through advanced statistical modeling. The resulting token output (also known as segmentation) minimizes index size, enhances search accuracy, and increases relevancy.
DECOMPOUNDING
RBL breaks down compound words into sub-components and delivers each individual element to be indexed. This is especially useful for increasing search relevancy in languages such as German and Korean.
WESTERN EUROPE- Catalan*- Czech- Danish- Dutch- English- Finnish*- French- German- Greek- Italian- Norwegian- Portuguese- Spanish- Swedish
EASTERN EUROPE- Albanian*- Bulgarian*- Croatian*- Estonian*- Hungarian- Latvian*- Polish- Romanian- Russian- Serbian*- Slovak*- Slovenian*- Turkish- Ukranian*
Search Engines
Advanced Morphological Features
Available Languages
LEMMATIZATION
Most search engines utilize a crude method of chopping off characters at the end of a word in the hopes of removing unimportant differences. This method, called stemming, often results in extra recall and poor precision. Instead, RBL finds the true dictionary form of each word, known as a lemma, by using vocabulary, context, and advanced morphological analysis. Indexing the root form increases search relevancy and slims the search index by not indexing all inflected forms. Alternative lemmas are also made available to supplement indexing.
PART OF SPEECH TAGGING
As part of the lemmatization process, statistical modeling is used to determine the correct part of speech, even with ambiguous words.
Each token is then tagged for enhanced comprehension and search relevancy.Because different languages have different grammars, part-of-speech tags differ.
Rosette supports the Universal POS Tag standard from which the developer can map to Penn Treebank or other POS tag systems.
Compatibility
MIDDLE EAST- Arabic- Hebrew- Pashto- Persian- Urdu
ASIA- Chinese, Simplified- Chinese, Traditional- Indonesian- Japanese- Korean- Malay*- Thai
Example: GermanSamstagmorgen is a compound word formed with Samstag (Saturday) and morgen (morning). Decompounding allows for an appropriate match when searching for "Samstag".
Example: EnglishLinguistic analysis is useful for every language; lemmatization for English improves recall and precision.
NOUN PHRASE EXTRACTION
Certain nouns, especially proper names, canbe very tricky to identify as a single entity.RBL groups the nouns and their modifiers, which is useful in document clustering and concept extraction.
SENTENCE DETECTION
The start and end of each sentence is automatically identified even though punctuation use may be ambiguous.
CHALLENGE QUERY STEM LEMMA
Two unrelated words may share a stem.
animalsanimated
anim animalanimate
Stemming may deliver unintended results.
several sever several
Irregular verbs and nouns stump the stemmer.
spoke spoke speak (v.)spoke (n.)
WEST COAST
1700 Montgomery St.San Francisco, CA 94111
FEDERAL
2553 Dulles View Dr.Suite 450Herndon, VA 20171
HEADQUARTERS
One Alewife CenterCambridge, MA 02140
EUROPE
Furzeground WayMiddlesex UB11 1BD, UK
ASIA
9-6 Nibancho, Chiyoda-kuTokyo 102-0084, Japan
Code Base Platform Support
Example: Chinese Consider the problem of indexing “Beijing University Biology Department” and a subsequent search for “student”:
Beijing University
Biology Department
(Student)
INDEX
BIGRAMMING
RBL MORPHOLOGICAL TOKENIZATION
SEARCH
学
学
学
4 51 2
1 2
652 3 3 4 6 7
Beijing
Beijing University Biology Department
(non-word) University (Student) Biology Dept.
(non-word)
"Student" Incorrectly hits “Beijing University Biology Department”
Correctly misses “Beijing University Biology Department”
* Limited Support
© 2015 Basis Technology Corporation. “Basis Technology Corporation” , “Rosette”, and “Highlight” are registered trademarks of Basis Technology Corporation. “Big Text Analytics” is a trademark of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are the property of their respective owners. (2015-06-29-RBL)
Rosette®
BIG TEXT ANALYTICS
RES
RNT
RNI
REX
RBL
RLILanguage Identifier Identify languages and encodings
Base Linguistics Search many languages with high accuracy
Entity Extractor Tag names of people, places, and organizations
Name Indexer Match names between many variations
Name Translator Translate foreign names into English
CategorizerCategorize Everything In Sight
Sentiment AnalyzerDetect The Sentiments Of Your Text
Entity Resolver Make real-world connections in your data
Better Search
Tagged Entities
Real Identities
Matched Names
Sorted Languages
Translated Names
Sorted Content
Actionable Insights
RES
RNT
RNI
REX
RBL
RLI ROSETTELanguage Identifier
ROSETTEBase Linguistics
ROSETTEEntity Extractor
ROSETTEName Indexer
ROSETTEName Translator
ROSETTECategorizer
ROSETTESentiment Analyzer
ROSETTEEntity Resolver
RCA
RSA
RCA
RSA
Rosette®
BIG TEXT ANALYTICS
RES
RNT
RNI
REX
RBL
RLILanguage Identifier Identify languages and encodings
Base Linguistics Search many languages with high accuracy
Entity Extractor Tag names of people, places, and organizations
Name Indexer Match names between many variations
Name Translator Translate foreign names into English
CategorizerCategorize Everything In Sight
Sentiment AnalyzerDetect The Sentiments Of Your Text
Entity Resolver Make real-world connections in your data
Better Search
Tagged Entities
Real Identities
Matched Names
Sorted Languages
Translated Names
Sorted Content
Actionable Insights
RES
RNT
RNI
REX
RBL
RLI ROSETTELanguage Identifier
ROSETTEBase Linguistics
ROSETTEEntity Extractor
ROSETTEName Indexer
ROSETTEName Translator
ROSETTECategorizer
ROSETTESentiment Analyzer
ROSETTEEntity Resolver
RCA
RSA
RCA
RSA