Text Analytics with Ambiverse

Text Analytics withAmbiverse

Text to Knowledge

www.ambiverse.com

Version 1.2, November 2016

WWW.AMBIVERSE.COM

Contents

1 Ambiverse: Text to Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1 Text is all Around 5

1.2 Ambiverse: Leading research to industry 6

1.3 Text to Knowledge 6

2 Named Entity Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 What is it? 7

2.2 Why is it Important? 8

2.3 Why is it Challenging? 8

2.4 Ambiverse Gives Meaning to Text 9

2.5 Ambiverse & YAGO, a Powerful Combination 9

2.6 Integrating Domain-specific Knowledge 10

2.7 Ambiverse Text Analytics in Facts 10

3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Ambiverse Search 13

3.2 Ambiverse Analyze 15

3.3 Ambiverse Write 16

3.4 Personalized Text Analytics 17

1. Ambiverse: Text to Knowledge

1.1 Text is all Around

Most of the information produced by persons, organizations, and public institutionsis in the form of text. In 2014, 300 million new websites were created.1 Every year,2 million blog posts are written,2 thousands of news sites around the globe publisharticles, and millions of new updates in social networks are generated. In fact, mostof human interaction is performed via unstructured data (e.g., articles, reports, socialnetwork posts, adds, comments, reviews, etc). Companies and public institutions alsotend to produce, on a regular basis, large quantities of internal documents.

This vast amount of text goes beyond of what is commonly understood as “big data”.Textual information is not easy to interpret, it basically lacks a well defined structure.To make use of it, it is necessary to provide the machine with certain “text understand-ing” capabilities so that these huge collections of documents can be computationallyanalyzed and transformed into useful data. It is being increasingly understood that textanalytics gives a big leverage to companies, persons, and public institutions.

The text analytics market is expected to grow at an average rate of 25% per year.3 By2013 only 1% of the companies were processing its textual information, by 2021 65%will do (Figure 1.1).4 In domains such as news, advertising, finance, insurance, amongothers, companies are starting to make sense of its textual data as a means of addingvalue to their businesses.

1http://www.internetlivestats.com/total-number-of-websites/2http://www.digitalbuzzblog.com/wp-content/uploads/2012/03/A-Day-In-The-Internet.jpg3http://www.digitalreasoning.com/resources/Text-Analytics-2014-Digital-Reasoning.pdf4http://www.federalnewsradio.com/wp-content/uploads/pdfs/031115_gartner_co_branded_

newsletter_turning_dark_data_into_smart_data.pdf

http://www.internetlivestats.com/total-number-of-websites/

http://www.digitalbuzzblog.com/wp-content/uploads/2012/03/A-Day-In-The-Internet.jpg

http://www.digitalreasoning.com/resources/Text-Analytics-2014-Digital-Reasoning.pdf

http://www.federalnewsradio.com/wp-content/uploads/pdfs/031115_gartner_co_branded_newsletter_turning_dark_data_into_smart_data.pdf

http://www.federalnewsradio.com/wp-content/uploads/pdfs/031115_gartner_co_branded_newsletter_turning_dark_data_into_smart_data.pdf

6 Chapter 1. Ambiverse: Text to Knowledge

2013 2016 2021

0

50

100

1

25

65

%of

com

pani

esus

ing

text

anal

ystic

s

Figure 1.1: The use of text analytics will increase dramatically in the coming years

1.2 Ambiverse: Leading research to industry

Ambiverse, a spin-off of the Max Planck Institute for Informatics, joins the new world oftext analytics. Ambiverse develops a technology to automatically understand, analyze,and manage big collections of textual data. Ambiverse is built on years of state-of-the-artresearch in text analytics. In 2015, Ambiverse received an EXIST Transfer of Researchgrant by the German Federal Ministry for Economic Affairs and the European Union.

1.3 Text to Knowledge

Our technology is focused on the recognition and disambiguation of named entitiesin text. It relies on years of experience in scientific developments by the Max PlanckInstitute for Informatics, a world leading institution in automatic text understanding.Our technology for named entity disambiguation was named the best named entitydisambiguation system by IBM5 and our corresponding scientific publications are amongthe most cited in the international automatic text understanding community67.

This cutting edge technology gives Ambiverse an advantage in the text analytics world,allowing the development of a new generation of text analytics tools to transform textualinformation into machine-understandable knowledge.

5D. A. Ferrucci (2012). Introduction to ‘This is Watson’. IBM Journal of Research and Development.6J. Hoffart et al. (2011). Robust Disambiguation of Named Entities in Text. In Proceedings of the

Conference on Empirical Methods on Natural Language Processing (EMNLP).7J. Hoffart et al. (2013). YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia.

Artificial Intelligence.

2. Named Entity Disambiguation

2.1 What is it?

A named entity, or simply entity, is a real-world object such as a person, an organization,a location or a product. Named entity disambiguation is the task of automaticallyrecognizing the names of these objects in text and identifying their real-world reference.For instance, in the sentence “Page played the hit Kashmir on his uniquely tunedLes Paul” our disambiguation system recognizes that the mention “Page” refers to thefamous rock guitarist Jimmy Page and not to Larry Page, founder of Google, and that“Les Paul” refers to the guitar and not its designer (see Figure 2.1).

Figure 2.1: Selecting the correct entity for each mention: Jimmy Page, the song Kashmirand a Les Paul guitar

8 Chapter 2. Named Entity Disambiguation

2.2 Why is it Important?

Ambiguous entities are all around us. The variety of names is much smaller than onemay think; there are more entities than names. Places are named after people, andpeople after people. Also places tend to have similar names, the same as peopleor products. In this context, knowing the real-world object of a reference producessignificant gains in text understanding capabilities.

If one wants to select or analyze documents mentioning the city of Paris in France, firstwe have to make sure that the mentions of “Paris” refer to the entity we are interested inand not, for instance, to the city of Paris in Texas. If one wants to efficiently search forinformation about Larry Page, we have to make sure to exclude documents about JimmyPage, another famous “Page”. Even more, if companies want to analyze customeropinions about cars, they need to understand that a tweet refers to the Jeep Wranglerand not to Jeans Wrangler (“I bought a Wrangler, and it is very comfortable”, “I sell mybrand new Wranglers”, Figure 2.3).

Knowing the correct meaning of a name allows to more efficiently analyze and searchover large text collections. Ambiverse developed a state-of-the-art technology to disam-biguate entities and a set of applications around it for smart text analytics.

Image from flickr (zombieite) - CC-BY 2.0

Figure 2.2: Ambiverse Text Analytics helps to identify the real enthusiastic fans.

2.3 Why is it Challenging?

Named entity mentions can be very ambiguous. The name “Page” can already refer tohundreds of entities, for more ambiguous names like “John” the potential candidatesare likely in the thousands.

A machine needs to resolve the meanings of all names in a single text assuring coher-ence among the entities (e.g., it is reasonable that “Paris” and “France” are simultane-ously assigned to the french capital and the European country). Naive approaches ofsimply enumerating all possible combinations would quickly come up against a brickwall. Even for a single sentence with three or four moderately ambiguous names, thecombination exceeds 100,000. For full documents, this becomes infeasible for even thefastest machines. Solving such a problem requires smart technologies as the one weprovide in Ambiverse Text Analytics.

2.4 Ambiverse Gives Meaning to Text 9

500

Page played the hit Kashmir on his uniquely tuned Les Paul.

50 5

= 125.000 possible candidate combinations

x x

Figure 2.3: There are 500 possible “Pages”, 50 possible “Kashmirs”, 5 possible “LesPaul”, leading to 125.000 possible entity combinations.

2.4 Ambiverse Gives Meaning to Text

Ambiverse Text Analytics opens up a wide range of possibilities to manage and un-derstand big text collections. Its main characteristic is the capability to understand themeaning of the objects, detaching them from their textual representations. For instance,in the sentences “Page played Kashmir.”, “Jimmy rocked the show at Knebworth!” and“James Patrick Page is one of the greatest guitarists of all time.”, Ambiverse Text Analyt-ics understands that “Jimmy”, “Page”, and “James Patrick Page” all refer to the sameperson (Figure 2.4). It understands real world concepts in text regardless of how theyare actually mentioned. This allows Ambiverse to develop a set of applications aroundthe named entity disambiguation technology, changing the way in which text is stored,searched, analyzed and produced.

Page played Kashmir. Jimmy rocked the show at Knebworth!

James Patrick Page is one of the greatest guitarists of all time.

Figure 2.4: Ambiverse Text Analytics understands that all sentences refer to the sameJimmy Page.

2.5 Ambiverse & YAGO, a Powerful Combination

All entities like Jimmy Page, Larry Page, Les Paul (person) and his self-named guitarare present in our YAGO knowledge graph [Hof+13]. YAGO, which is derived fromWikipedia, can be thought of as a very large collection of entities.

YAGO also contains accurate characterizations of all entities. It knows that Larry Pageis a computer scientist, a corporate director, and a billionaire, that Google is a U. S.company, or that Jimmy Page is a guitarist and a musician. These characteristics of theentities are called categories or classes and are the key to develop useful applications

10 Chapter 2. Named Entity Disambiguation

around named entity disambiguation technology. An example of YAGO is shown inFigure 2.5.

created

musiciansong

artifact

happenedin

played at

plays

guitar

type

subclass

type type

subclass

1975

in

was played at

Classes

Entities

Figure 2.5: Example of the knowledge stored in YAGO: The entities, their classes, andthe relations between them.

2.6 Integrating Domain-specific Knowledge

The flexible architecture of Ambiverse Text Analytics allows the use of additional domain-specific entities. Other knowledge graphs (e.g., a company-specific knowledge graphor a product catalog) can be easily integrated into our system or a specific user canconcentrate in a specific slice of YAGO. This enables companies to focus on the entitiesof importance to them, like their products or customers. Ambiverse Text Analytics to befully customized to the specific needs of our customers.

2.7 Ambiverse Text Analytics in Facts

2.7.1 Performance

The following numbers correspond to average length news articles processed on acompute instance with 16 CPU cores and 32 GB of memory.

• Documents per hour with high accuracy: 20.000• Documents per hour with highest accuracy: 6.000

The exact accuracy depends on the nature of the documents. An experimental evalua-tion on a large set of newswire documents [Hof+11] showed 80% accuracy for the highaccuracy setting and 83% accuracy for the highest accuracy setting.

2.7 Ambiverse Text Analytics in Facts 11

2.7.2 Languages

We currently support English, Spanish, Chinese, and German.

2.7.3 Knowledge Graph

A brief comparison of the size of YAGO and other prominent openly available knowledgegraphs shows that YAGO is among the most comprehensive and precise ones. YAGO’sdistinct advantages are the clear semantic modelling of entities and especially thespecific class hierarchy, ranging from very general categories like “person” to highlyspecific ones like “British rhythm and blues boom musicians”. Also, YAGO is the onlyknowledge graph that has been evaluated in terms of accuracy [Hof+13].

Entities Classes Accuracy

English YAGO3 3.5 million 550 thousand > 95%Combined YAGO3 (10 languages) 4.6 million 570 thousand > 95%English DBpedia 4.8 million 735 not evaluatedCombined DBpedia 38.3 million 735 not evaluated

Table 2.1: Facts about the YAGO knowledge graph

! More details about YAGO are available at:http://www.yago-knowledge.org

http://www.yago-knowledge.org

3. Applications

Ambiverse’s cutting edge text analysis technology allows the development of a wholerange of next-generation applications to manage, search, analyze and produce text.

3.1 Ambiverse Search

3.1.1 Searching for Entities

Traditional search engines take words or phrases as input and return a set of documents,in which these words or phrases may be more relevant. They have limited understandingof the user intent in the sense that they do not give meaning to the input words. Theyonly understand their form. For instance, they cannot understand if the input word“Paris” refers to the city in France, to Paris Hilton, or to the mythological Greek character.Searching for “Paris” in a regular search engine will return documents where the word“Paris” appears without distinguishing which Paris it is. Probably documents referringto the city of Paris in France will be ranked at the top since it is the most popular entity.Users searching for less common “Paris” references should refine their input (e.g. “ParisGreece Troy”), forcing them to express their intention by incorporating (sometimesunavailable) extra knowledge into the input.

However, if the documents are first processed via Ambiverse Text Analytics (meaningthat all entities in all documents have been previously identified), the user can search forthe entities themselves independently of how they are mentioned in the text, and withoutany additional background knowledge. The user intent is fully described in the inputentity itself. For instance, the user can directly search for Paris Hilton and no matterhow she is referred to (e.g. “Paris”, “Paris Hilton”, “Hilton’s granddaughter”, etc.), alldocuments in which she is mentioned will be retrieved (and properly ranked). All otherdocuments where other “Paris” occurrences appear (Paris, France; the Greek character;Paris, Texas) will be excluded. This type of ambiguity is more common that one maythink, resulting in highly imprecise search results.

Ambiverse Search gives the user the capability to search for meaning or concepts onhuge text collections, reaching more precise results by better interpreting the user’s

14 Chapter 3. Applications

Figure 3.1: Searching for the word “Prada” is imprecise due to its ambiguity.

Figure 3.2: Searching for the company Prada gives precise results: Ambiguities havebeen resolved.

intent, abstracting meaning from textual forms. Out of the box, we provide search for4.6 million entities, to which, in addition, customer-specific entities can easily be added(see Section 2.6). Figures 3.1 and 3.2 provide an example of regular and smart search.

! Contact us for a demonstration of the prototype.

3.2 Ambiverse Analyze 15

3.1.2 Searching for Categories: the Power of the YAGO Knowledge Graph

As mentioned before, YAGO contains information about categories for each entity. Thisallows us to incorporate a new abstraction layer to our search, something impossiblein traditional search engines. Instead of searching for a given entity, we can directlysearch for a category so that a set of entities is grouped in our search.

For instance, we can directly search for fashion labels, and all the documents mentioninga fashion label (e.g., Prada, Gucci, Chanel, etc.) will be retrieved. We can also searchfor documents containing German soccer players (e.g., Schweinsteiger, Thomas Müller,Mesut Özil, etc.), Harvard alumni (e.g., Barack Obama, Ban Ki-Moon, Natalie Portman,Robert Solow, etc.), or any other category available in our knowledge graph. The secrethere is that Ambiverse Text Analytics is capable of identifying the entities in the textand our knowledge graph knows the categories of those entities. Our knowledge graphcontains more than 570k categories.

Figure 3.3: Searching for the category high fashion brands finds documents on allfashion labels.

3.2 Ambiverse Analyze

Understanding entities in text allows a whole new range of text analytics tools. Forinstance, one can visualize the correlation over time between two companies or eventhe correlation between a company and its sector. Ambiverse Analyze helps youunderstand how mentions of the fashion label Prada correlate to mentions of all fashionlabels (Figure 3.4).


16 Chapter 3. Applications

Figure 3.4: Ambiverse Analyze plots the trends of Prada against all other fashion labels.

3.3 Ambiverse Write

Understanding entities is also a key element in the production of intelligent texts. Wedeveloped Ambiverse Write, a smart authoring platform for intelligent text production:While typing, entities are automatically recognized, relevant entities are suggested andbackground information is provided to the author on the fly. An author writing aboutfashion topics will get suggestions about fashion brands or designers, and backgroundinformation about them directly while typing.

Figure 3.5: Ambiverse Write allows authors to write texts and link entities at the sametime.

Once the writing process has been completed, the text is ready for smart publishing:it gets annotated with the correct entities and can be immediately integrated intoAmbiverse Search and Analyze. This integration also enables Ambiverse to continuouslyimprove the quality of its technology, incorporating user specific annotations.

3.4 Personalized Text Analytics 17

In the example shown in Figure 3.5, authors can get a deeper understanding aboutthe entities they are writing about without ever leaving the editor. Additionally, the linksimprove the reading experience for all readers, adding value to the article, making themstay longer, and use the article as a prominent reference.


3.4 Personalized Text Analytics

Companies or even individual users usually have their own knowledge graph or want toadd their own customization to YAGO (e.g., they may be interested in only a part of itor modify some entities or categories). We developed a framework that allows usersto add their own entities to their specific knowledge graph making our disambiguationtechnology fully customizable to each particular user and/or organization. AmbiverseText Analytics will then focus on entities of interest for the user or adapt to the settingthat the user considers most appropriate.

The tool for augmenting an existing knowledge graph is very intuitive and extremelysimple to use. The user has different possibilities to easily generate its customizedknowledge graph without specific knowledge of our technology.


References

[Fer12] David A Ferrucci. “Introduction to ‘This is Watson’”. In: IBM Journal of Re-search and Development 56.3.4 (2012), pages 1–15.

[Hof+11] Johannes Hoffart et al. “Robust Disambiguation of Named Entities in Text”. In:Proceedings of the Conference on Empirical Methods in Natural LanguageProcessing. 2011, pages 782–792 (cited on page 10).

[Hof+13] Johannes Hoffart et al. “YAGO2: A Spatially and Temporally Enhanced Knowl-edge Base from Wikipedia”. In: Artificial Intelligence 194 (2013), pages 28–61 (cited on pages 9, 11).

More details about YAGO are available at: http://www.yago-knowledge.org

Ambiverse GmbHCampus E1 466123 SaarbrückenGermany

Phone: +49 681 9325-5024Fax: +49 681 9325-5099E-Mail: [email protected]

WWW.AMBIVERSE.COM

http://www.yago-knowledge.org

Date post:	15-Feb-2017
Category:	Documents
Upload:	phamdat
View:	239 times
Download:	4 times