+ All Categories
Home > Internet > II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of...

II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of...

Date post: 22-Jan-2018
Category:
Upload: dr-haxel-congress-and-event-management-gmbh
View: 430 times
Download: 0 times
Share this document with a friend
52
Patrick Beaucamp Founder of the Vanilla, AklaBox & Data4Citizen Projects Mail : [email protected] Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment II-PIC, Bangalore 2 th November 2017 1 II-PIC, Bangalore
Transcript
Page 1: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Patrick BeaucampFounder of the Vanilla, AklaBox & Data4Citizen Projects

Mail : [email protected]

Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

II-PIC, Bangalore 2th November 2017

1II-PIC, Bangalore

Page 2: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

2II-PIC, Bangalore

Page 3: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Presentation Agenda

Open Source Search Engine & Search PlatformFeatures expected for Search Platforms (Interface)

3II-PIC, Bangalore

Open Source Platform at French MinistryProject Context

Platform Architecture

WebSite Powered by a Search engine

Personal Experience of Search – Search Ideas

Page 4: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

You know Solr ?

4II-PIC, Bangalore

Page 5: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Part 1 – Search concepts and Ideas« Sharing and awaking your mind »

5II-PIC, Bangalore

Page 6: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Searching … and finding !

6

How many times per day do you Google ? (search,

maps, translate …)

Tribute to Open Source at II-PIC … thanks Christoph !

Search is the first Step : collecting information

II-PIC, Bangalore

Page 7: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Searching ???

7

Using Search Engine (and beeing influenced by Seo)

Search is a subject in itself :

II-PIC, Bangalore

Register to News Feed and Alerts : « Push Mode »

« Artificial Intelligence » facts : an algorithm is working

for you : Facebook proposal , Gmail reminder …

« minority report » is there !

Page 8: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

8II-PIC, Bangalore

User Behavior Analysis for Sales & Marketing Team, Web Design Team

WebSite as a Vitrin :

Which Menu & Sub menu are visited ?

Where are the dead branch ?

No real « Search Approach »

Before

Browsing behavior

Page 9: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

9II-PIC, Bangalore

Browsing behavior

User Behavior Analysis for Sales & Marketing Team, Web Design Team

WebSite as a Search Interface

What people are looking for ?

How are they searching?

Now

Review your SEO

Page 10: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Searching … and finding !

10II-PIC, Bangalore

Page 11: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Searching … and finding !

11

We all became private investigators one day or another

II-PIC, Bangalore

Page 12: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Searching … and finding !

12II-PIC, Bangalore

Page 13: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Searching … and finding !

13

Different search engine lead to different results

II-PIC, Bangalore

Page 14: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Searching … and finding !

14

Different search engine by country

II-PIC, Bangalore

Page 15: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Searching … and finding !

15

Funny word : SEO … its more « how to be found on

Internet » … and you need to pay for it !

II-PIC, Bangalore

Page 16: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Searching … and finding !

My personal experience

16

I tried to find a person during 23 years, roughly from 1993

to 2016

From 1993 to 1998 : no search engine available …

only private investigator ?

From 1999 to 2015 : regular Search – no results

I founded this person on facebook, not on google

From a browser : « f + tab » … « g + tab », « y + tab » …

Some years : no search, other years : multiples search

II-PIC, Bangalore

Page 17: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Searching … and finding !

17

The person I was looking published on facebook using

his/her real name – its his/her decision to be visible or not

Where do we stand with the « Right to Forget »

II-PIC, Bangalore

Page 18: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Searching … and finding !

18

Companies like Facebook have tons of data : they need to

provide search infrastructure (indexing + search interface)

I was lucky to make a try with facebook search interface

II-PIC, Bangalore

Page 19: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Searching … and finding !

19

Discovery of Cholera – 1854 (John Snow)

http://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak

II-PIC, Bangalore

Page 20: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Searching … and finding !

20

Bicycle Accident in Street : who is taking care of trafic management

Example in Boston : http://www.boston.com/bostonglobe/editorial_opinion/blogs/the_angle/2010/12/bike_crash_map.html

Open Data

II-PIC, Bangalore

Page 21: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Searching … and finding !

21

LION – 2016 (Garth Davis)

Mistake 1 : Ganesh Tanei – Mistake 2 : Saroo

II-PIC, Bangalore

Page 22: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

« Internal » Searching Strategy

22II-PIC, Bangalore

It’s easy to add a « search » feature

In WebSite (Drupal Hosting)

Company don’t want to live

this again !

You need a Strategy for your internal data : its your digital assets

Page 23: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Part 2 – Search ComponentsThe « Recipe »

23II-PIC, Bangalore

Page 24: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

OpenSource LandScape

24

Crawling

Indexing

Storing

WebSite

Reference

WebSite

AccessibilityUpdate Management

Search Interface

Result Visualization

Auto Completion

Natural Language

Voice Recognition

Maps

Ads

Unstructured data

Access Management

II-PIC, Bangalore

Page 25: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Search Platform Objectives

Constraints : being able to reach WebSite and content :Internal WebSites (Intranet) & External WebSites

Internal Document Repositories

25

Being able to index WebSite content (and page updates)

Beeing able to store unstructured data

Crawling

Storing

Indexing

II-PIC, Bangalore

Page 26: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Search Platform Objectives

26

Provide usable Search results (auto classification,

visualization)

Don’t Forget why and what you search :

• You search in existing documents

• You need visualization tools

• Its not a crystal ball : search reflects the past

Provide usable Search interfaces (semantic search, multi

language search …)

Search Interface

Result Visualization

II-PIC, Bangalore

Page 27: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

27

Before indexing your document base, you need to access it !

Apache Nutch is a highly extensible and scalable open source web crawler

software project.

Reference : http://nutch.apache.org/

Nutch

II-PIC, Bangalore

Page 28: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

28

Solr

• What is Solr– Indexation and Search Engine

• Promoted by the Apache Foundation

• Built on Top of Apache Lucene (Java Search library)

– Major engine characteristics• Scalable, fault tolerance, distribution indexation process, dynamic

workload balancer, centraized configuration

– Technical environment• Java

• Embeded Jetty server for platform administration

II-PIC, Bangalore

Page 29: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

29

Solr

Main characteristics

Admin Interface

Flexible and scalable Configuration

Modular

Multiple index management with a signle instance

II-PIC, Bangalore

Page 30: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

30

Solr

Main characteristics

Standard communication interfaces (html, xml, json)

Configuration can be done with or without schema

Real time Indexation

II-PIC, Bangalore

Page 31: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

31

Solr

Main characteristics

Customizable Full Text analysis

Rich documents indexation (using Tika)

II-PIC, Bangalore

Page 32: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

32

Solr

Main characteristics

Search by facet and filters

Term suggestion and orthograph correction

Geospatial Search

II-PIC, Bangalore

Page 33: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

33

Solr

Solr behavior

II-PIC, Bangalore

Page 34: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

34

-Synonyms

- It is possible to extend the search to synonyms if they are listed in a

glossary. For example, to find articles containing synonyms to “TV” when

you search with the word TV.

-Metadata

- Dictionary for list of searchable keywords

Search Engine Basic (1/2)

II-PIC, Bangalore

Page 35: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

35

-Reserved Words, Protected Words

- Indexing usually uses stemming, which is to reduce words to their root, for

example "Developp" to find items also contain the word when trying to

develop the word development. However, sometimes there are adverse

lemmatizations, indexing under one lemma two words that have no

relation. It is possible to prevent the stemming of words by listing them in

a file protwords.txt.

-StopWords

- The stopwords are meaningless words. A word considered insignificant

will be ignored. Note that some words are insignificant in some contexts,

others have homonyms signifiers. For example, can refer to a summer

season (rather mean) or past participle of the verb to be (relatively

insignificant). Stopwords.txt the file looks like this

Search Engine Basic (2/2)

II-PIC, Bangalore

Page 36: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

36

-Multi Language support (this is where commercial search engine have still more

to bring to customer), even there is now Asian type language support (Hindi,

Thai, Chineese, …)

-Elision :

- Elisions are a feature of the French, which consist of a contraction of the

words like or when they are followed by a vowel. Example: + aircraft gives

the aircraft. It is possible to remove these elisions using a lexicon.

-Limits solved other the past 3 years

• Full text search interface (language with search engine)

• SubQuery support : now its ok starting with Solr 4.7 (we are v6)

• Scalability (this is where Solr is taking technical advantage)

Search Engine Current Limits

II-PIC, Bangalore

Page 37: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

37

-Advance indexing and querying tools.

-Provides distributed searching capabilities to prevent bottleneck for a particular

server.

-Provides document excerpts (snippets) generation that provides summary of the

search

-Relevance ranking display extracts from the documents based on the query.

Search Interface expectation (1/3)

II-PIC, Bangalore

Page 38: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

38

-Duplicate document detection, including fuzzy near duplicates

-Rich Document Parsing and Indexing without using Database Indexing.

-Ranking control carry out a targeted ranking of individual documents.

-Search Grouping by Type / Tag / Categories (General page, documents, images)

Search Interface expectation (2/3)

II-PIC, Bangalore

Page 39: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

39

-Multi Criteria support

-Ranking

-Natural language support

-Apps Support (Android, Ipad)

Search Interface expectation (3/3)

II-PIC, Bangalore

Page 40: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Part 3 – A Real Project

40II-PIC, Bangalore

Page 41: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Project at Ministry

Initial decision and guidelines from Ministry

41

New WebSite will be done using Drupal CMS 8.2

WebSite should be powered by a « Google alike Search Toolbar »

WebSite – Infrastructure – should connect with multiples other

WebSite

All Infra (Software) must be Open Source components

II-PIC, Bangalore

Page 42: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Project at Ministry

42

http://www.developpement-durable.gouv.fr/

II-PIC, Bangalore

https://www.ecologique-solidaire.gouv.fr/

Page 43: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Project at Ministry

43

http://www.developpement-durable.gouv.fr/

II-PIC, Bangalore

Page 44: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Project at Ministry - Architecture

44II-PIC, Bangalore

Page 45: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Project at Ministry - Architecture

45II-PIC, Bangalore

Page 46: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Project at Ministry - Technical

46

Projects Steps

Nutch crawler for various WebSite

• Facebook, LinkedIn, Twitter, Youtube …

• Internal WebSite, Previous WebSite

Drupal Forms for Metadata & indexation

• Specific Forms for different kind of documents

• Drupal CMS process to add new content

Drupal 8 Module for Solr : custom search, monitoring, reporting

• Existing drupal solr is limited to single instance of drupal

• Not possible to use Solr Admin interface

II-PIC, Bangalore

Page 47: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Project at Ministry - Technical

47

Additional PHP libraries

Curl : Communication Drupal-Solr (http-get http-post & attached file)

Ssh2 : server administration command

Zookeeper : Communication Drupal-Zookeeper

MemCached : Communication Drupal-Memcached

Solarium : Communication Drupal-Solr (abstraction layer)

GoogleApi : youtube content indexation

II-PIC, Bangalore

Paragraph : News and Content edition

Piwik : Statistics (like Google Analytics)

Page 48: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Project at Ministry – Admin Interface

48

Drupal8 Addon to setup the global infrastructure (Zookeeper, Solr)

II-PIC, Bangalore

Page 49: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Project at Ministry – Admin Interface

49

Drupal8 Addon to monitor the global infrastructure - Statistics

II-PIC, Bangalore

Page 50: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Project at Ministry - Validation

50

Projects Validation & Deployment

No problems with Zookeeper, Solr, Nutch

Stress tests for the global platform : initial slow down with 10 000

simultaneous connection

Sub-Project : Adressing the Single Point of Failure

Solution : Problems with Drupal & MySql -> MemCached

II-PIC, Bangalore

Page 51: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

Project at Ministry - Next

51

Next Steps

Review of WebSite content … new Ministry

New Content to be indexed :

• Other WebSite and Social Content

• New set of document to be added in the repository

II-PIC, Bangalore

Page 52: II-PIC 2017: Custom Open Source Search Engine with Drupal 8 and Solr at French Ministry of Environment

52II-PIC, Bangalore


Recommended