Creating an Open Source Genealogical Search Engine with Apache Solr

Creating an Open Source

Genealogical Search Engine

With Apache Solr

Brooke Schreier Ganz

[email protected]

Twitter: @LeafSeek

www.LeafSeek.com

mailto:[email protected]

Hi, I‟m Brooke

• I make web stuff for fun, and (sometimes) for profit

• Web Developer at IBM.com and Disney Consumer Products

• Lead Programmer at TMZ.com (yikes, sorry about that)

• Senior Web Producer at Bravo cable TV network and its spin-off websites

• Big dork

• Big genealogy dork

• #BigData dork

Meet Gesher Galicia

• Non-profit 501(c)3 genealogy society

• Founded in 1993

• Hundreds of members, worldwide

• E-mail discussion group

• New website development in progress

(existing website is fugly)

• Needs a search engine…for data

The Old Problem

The Old Problem

The New Problem

The New Problem

• Diverse Data Languages

(German, Polish, Ukrainian, Russian, Yiddi

sh, Hebrew, English…)

• Diverse Data Types

(births, marriages, deaths, divorces, tax

lists, landsmanschaften lists, industrial

permit lists, school

yearbooks, governmental yearbooks…)

Diverse Data Shapes

Diverse Data Shapes

Diverse Data Shapes

Existing solutions

• They‟re okay...for small numbers of

databases, with small amounts of data

– Steve Morse's One-Step Tool Creator

– Roll-your-own solution with PHP and MySQL

• Both get more difficult to manage as data

sets increase in number and complexity

In space, no one can hear your data scream

To Sum Up

• There are lots of ways to publish your tree

• …but not so many ways to publish your

data

• Surely there must be a way to deal with

this?

So I Made A Thing

But “That Thing I Made With The Database And Stuff”

was kind of an awkward name, so I called it

LeafSeek

This is the part where I show you all

the shiny new All Galicia Database

http://search.geshergalicia.org/

Meet Apache Solr

• Highly functional open source search

platform

• Based on Apache Lucene (Java)…

• …plus a web wrapper/API

• Not the prettiest or simplest tool

• FREE and open source

Saves Time, and Heartache

Saves Time, and Stomachache

File Structure: Back-End

Welcome to /conf

The Important Stuff

solrconfig.xml

solrconfig.xml

Make sure this part is configured, so you can

import data:

How to get your data into Solr

• Step 1: Make a properly-formatted spreadsheet

• Step 2: Save spreadsheet as a .CSV file

• Step 3: Create a MySQL database + table

• Step 4: Import CSV into that new table

• Step 5: Add a Unique Auto-Incrementing Primary Key called “id” (INT)

• Step 6: Add this table‟s information todb-data-config.xml

db-data-config.xml

• Basic XML file that tells Solr how to grab

data from your MySQL database(s)

• Add new <dataSource> for new databases

• Add new <entity> for new tables within the

databases

• You need to make sure your MySQL

connector .jar is installed for this to work

Import!

schema.xml

• FieldTypes, Fields, and CopyFields

• FieldTypes give indexing and querying

instructions to “buckets”

• Fields say what‟s what and whether to

make something facetable or not

• CopyFields collect Fields together into

extra FieldTypes

schema.xml - FieldTypes

• 5 Custom FieldTypes (so far):

– givenname

– surname

– surname_bmpm (phonetic)

– place (note: not merely town)

– year (which we‟re treating as text right now)



schema.xml - Fields

schema.xml - Fields

• Uppercase fields come from the name of the MySQL column name

• Examples:

– Year

– SchoolYear

– Surname

– FathersTown

– MothersFathersGivenName

– MaternalGrandfathersGivenName

schema.xml - Fields

• Lowercase fields were added once the

data is getting inputted to Solr, and start

with the prefix record_

• Examples:

– record_type (birth, death, tax, whatever)

– record_source (name of repository)

– record_latlong (latitude,longitude)

– record_id (required!)

schema.xml - Fields

• You do not have to explicitly define every Field.

• If something is imported that is not named and defined in schema.xml it will just be indexed as a straight-up text string, with nothing done to it.

• Which is fine.

• But IMHO it‟s better to define everything anyway so you can remember what‟s what and what you are doing to it.

schema.xml - CopyFields

Add-ons and nice-to-have‟s

(for the back-end)• Wildcards, and lots of „em

• Non-name words handled through stopwords.txt

• Nicknames and name synonyms handled through synonyms.txt

• Two files included:– synonyms_-_american-anglo-saxon.txt

– synonyms_-_polish-ukrainian-jewish.txt

• Should be based on your data and yourhistorical/ethnic community standards

More add-ons and nice-to-have‟s

(for the back-end)

• Translate your site into different languages – multi-lingual content deserves a real multi-lingual website

– Pass user preferences through GET value or through accept-language header or read from a cookie or whatever you want

• Built-in performance monitoring hooks for New Relic

• Soundalike searches for surname variants

– Levenstein distance

– “Regular” Soundex, Metaphone, Caverphone, etc.

This is the part where I tell

the story about

THE SAGA

of Beider-Morse Phonetic Matching

(BMPM)

Relevancy

• Right now, we‟re using exact matches

• (Of course, “exact” includes

wildcards, alternate names /

synonyms, etc.)

• Like “Old Search” on Ancestry.com

• DisMax! Boosting fields! Scoring!

• (…but not yet)

• Problems with records with multiple

people‟s names in the record

Lots of Front-End Options

• Ruby:

Sunspot, RSolr, Tanning Bed, acts-as-solr

• Django/Python:

Haystack, Sunburnt, solrpy, pysolr

• Older PHP options:

PECL, solr-php-client

• Plugins for blog/CMS systems:

Drupal, WordPress

Meet Solarium

• http://www.solarium-project.org/

• New, open source PHP wrapper for Solr

• Very active development

• Version 2.4 coming soon

File Structure: Front-End

Meet Solarium: The Config

Meet Solarium: The Guts

Meet Solarium: The Guts

• You choose the parts of your data to facet

• Data is submitted to the front-end by POST, not by GET, so the URL never changes

• You can (and should) paginate results listings

• You can't actually see the Solr server's URL from the front-end, not even in view-source


(for the front-end)

• A welcome screen with information about

the database's contents

• Instructions (maybe twice)

• How many records in the database?

• How many datasets?

• What features are coming next?

• What datasets are coming next?


(for the front-end)

• Make good UI choices

• Pop-Up Google Maps

• Tooltips to reduce UI clutter

• Cross-browser compatibility

• Still stuck with IE 7 and 8

• CSS and code that degrades gracefully

• No small text

Bird‟s Eye View of Your Data

• What (surnames, towns, etc.) do I have in

my data?

• What are the TOP (surnames, towns, etc.)

in my data?

• Finding incorrect data

– Outlying years and dates

– Figure out that hard-to-read surname

• Make charts and graphs from your data

The (Back-End) Future! (Maybe.)

• Date ranges, instead of just years

• Auto-complete as you type

• “Did you mean...?”

(based on data frequency)

• “More Like This”

(would have to do scoring)

• Record bookmarking system (hashes?)

The (Front-End) Future! (Maybe.)

• Hierarchical facets for locations

• Disambiguating locations

• Social sharing of individual records

• New genealogy data schema

http://historical-data.org/

• Membership login system

Please Do Not Build That Wall

• Password protect some of the databases

• Password protect some of the data

• Open data, but pay for record or surname

bookmarking system

• Open data, but pay for API access

• Open data, but sell online ads

• Open data, but give people guilt trips

Presenting LeafSeek!

• Free and Open Source

• Code is all on GitHub

• Please add, edit, fix, change, tinker

• …and use it!

Why is this FREE?

And why is this important?

Thank you! :-)

Date post:	02-Jul-2015
Category:	Technology
Upload:	brooke-ganz
View:	2,915 times
Download:	2 times

Creating an Open Source Genealogical Search Engine with Apache Solr

Technology