Date post: | 02-Jul-2015 |
Category: |
Technology |
Upload: | brooke-ganz |
View: | 2,915 times |
Download: | 2 times |
Creating an Open Source
Genealogical Search Engine
With Apache Solr
Brooke Schreier Ganz
Twitter: @LeafSeek
www.LeafSeek.com
Hi, I‟m Brooke
• I make web stuff for fun, and (sometimes) for profit
• Web Developer at IBM.com and Disney Consumer Products
• Lead Programmer at TMZ.com (yikes, sorry about that)
• Senior Web Producer at Bravo cable TV network and its spin-off websites
• Big dork
• Big genealogy dork
• #BigData dork
Meet Gesher Galicia
• Non-profit 501(c)3 genealogy society
• Founded in 1993
• Hundreds of members, worldwide
• E-mail discussion group
• New website development in progress
(existing website is fugly)
• Needs a search engine…for data
The Old Problem
The Old Problem
The New Problem
The New Problem
• Diverse Data Languages
(German, Polish, Ukrainian, Russian, Yiddi
sh, Hebrew, English…)
• Diverse Data Types
(births, marriages, deaths, divorces, tax
lists, landsmanschaften lists, industrial
permit lists, school
yearbooks, governmental yearbooks…)
Diverse Data Shapes
Diverse Data Shapes
Diverse Data Shapes
Existing solutions
• They‟re okay...for small numbers of
databases, with small amounts of data
– Steve Morse's One-Step Tool Creator
– Roll-your-own solution with PHP and MySQL
• Both get more difficult to manage as data
sets increase in number and complexity
In space, no one can hear your data scream
To Sum Up
• There are lots of ways to publish your tree
• …but not so many ways to publish your
data
• Surely there must be a way to deal with
this?
So I Made A Thing
But “That Thing I Made With The Database And Stuff”
was kind of an awkward name, so I called it
LeafSeek
This is the part where I show you all
the shiny new All Galicia Database
http://search.geshergalicia.org/
Meet Apache Solr
• Highly functional open source search
platform
• Based on Apache Lucene (Java)…
• …plus a web wrapper/API
• Not the prettiest or simplest tool
• FREE and open source
Saves Time, and Heartache
Saves Time, and Stomachache
File Structure: Back-End
Welcome to /conf
The Important Stuff
solrconfig.xml
solrconfig.xml
Make sure this part is configured, so you can
import data:
How to get your data into Solr
• Step 1: Make a properly-formatted spreadsheet
• Step 2: Save spreadsheet as a .CSV file
• Step 3: Create a MySQL database + table
• Step 4: Import CSV into that new table
• Step 5: Add a Unique Auto-Incrementing Primary Key called “id” (INT)
• Step 6: Add this table‟s information todb-data-config.xml
db-data-config.xml
• Basic XML file that tells Solr how to grab
data from your MySQL database(s)
• Add new <dataSource> for new databases
• Add new <entity> for new tables within the
databases
• You need to make sure your MySQL
connector .jar is installed for this to work
Import!
schema.xml
• FieldTypes, Fields, and CopyFields
• FieldTypes give indexing and querying
instructions to “buckets”
• Fields say what‟s what and whether to
make something facetable or not
• CopyFields collect Fields together into
extra FieldTypes
schema.xml - FieldTypes
• 5 Custom FieldTypes (so far):
– givenname
– surname
– surname_bmpm (phonetic)
– place (note: not merely town)
– year (which we‟re treating as text right now)
schema.xml - FieldTypes
schema.xml - FieldTypes
schema.xml - Fields
schema.xml - Fields
• Uppercase fields come from the name of the MySQL column name
• Examples:
– Year
– SchoolYear
– Surname
– FathersTown
– MothersFathersGivenName
– MaternalGrandfathersGivenName
schema.xml - Fields
• Lowercase fields were added once the
data is getting inputted to Solr, and start
with the prefix record_
• Examples:
– record_type (birth, death, tax, whatever)
– record_source (name of repository)
– record_latlong (latitude,longitude)
– record_id (required!)
schema.xml - Fields
• You do not have to explicitly define every Field.
• If something is imported that is not named and defined in schema.xml it will just be indexed as a straight-up text string, with nothing done to it.
• Which is fine.
• But IMHO it‟s better to define everything anyway so you can remember what‟s what and what you are doing to it.
schema.xml - CopyFields
Add-ons and nice-to-have‟s
(for the back-end)• Wildcards, and lots of „em
• Non-name words handled through stopwords.txt
• Nicknames and name synonyms handled through synonyms.txt
• Two files included:– synonyms_-_american-anglo-saxon.txt
– synonyms_-_polish-ukrainian-jewish.txt
• Should be based on your data and yourhistorical/ethnic community standards
More add-ons and nice-to-have‟s
(for the back-end)
• Translate your site into different languages – multi-lingual content deserves a real multi-lingual website
– Pass user preferences through GET value or through accept-language header or read from a cookie or whatever you want
• Built-in performance monitoring hooks for New Relic
• Soundalike searches for surname variants
– Levenstein distance
– “Regular” Soundex, Metaphone, Caverphone, etc.
This is the part where I tell
the story about
THE SAGA
of Beider-Morse Phonetic Matching
(BMPM)
Relevancy
• Right now, we‟re using exact matches
• (Of course, “exact” includes
wildcards, alternate names /
synonyms, etc.)
• Like “Old Search” on Ancestry.com
• DisMax! Boosting fields! Scoring!
• (…but not yet)
• Problems with records with multiple
people‟s names in the record
Lots of Front-End Options
• Ruby:
Sunspot, RSolr, Tanning Bed, acts-as-solr
• Django/Python:
Haystack, Sunburnt, solrpy, pysolr
• Older PHP options:
PECL, solr-php-client
• Plugins for blog/CMS systems:
Drupal, WordPress
Meet Solarium
• http://www.solarium-project.org/
• New, open source PHP wrapper for Solr
• Very active development
• Version 2.4 coming soon
File Structure: Front-End
Meet Solarium: The Config
Meet Solarium: The Guts
Meet Solarium: The Guts
• You choose the parts of your data to facet
• Data is submitted to the front-end by POST, not by GET, so the URL never changes
• You can (and should) paginate results listings
• You can't actually see the Solr server's URL from the front-end, not even in view-source
Add-ons and nice-to-have‟s
(for the front-end)
• A welcome screen with information about
the database's contents
• Instructions (maybe twice)
• How many records in the database?
• How many datasets?
• What features are coming next?
• What datasets are coming next?
Add-ons and nice-to-have‟s
(for the front-end)
• Make good UI choices
• Pop-Up Google Maps
• Tooltips to reduce UI clutter
• Cross-browser compatibility
• Still stuck with IE 7 and 8
• CSS and code that degrades gracefully
• No small text
Bird‟s Eye View of Your Data
• What (surnames, towns, etc.) do I have in
my data?
• What are the TOP (surnames, towns, etc.)
in my data?
• Finding incorrect data
– Outlying years and dates
– Figure out that hard-to-read surname
• Make charts and graphs from your data
The (Back-End) Future! (Maybe.)
• Date ranges, instead of just years
• Auto-complete as you type
• “Did you mean...?”
(based on data frequency)
• “More Like This”
(would have to do scoring)
• Record bookmarking system (hashes?)
The (Front-End) Future! (Maybe.)
• Hierarchical facets for locations
• Disambiguating locations
• Social sharing of individual records
• New genealogy data schema
http://historical-data.org/
• Membership login system
Please Do Not Build That Wall
• Password protect some of the databases
• Password protect some of the data
• Open data, but pay for record or surname
bookmarking system
• Open data, but pay for API access
• Open data, but sell online ads
• Open data, but give people guilt trips
Presenting LeafSeek!
• Free and Open Source
• Code is all on GitHub
• Please add, edit, fix, change, tinker
• …and use it!
Why is this FREE?
And why is this important?
Thank you! :-)