Date post: | 12-Apr-2017 |
Category: |
Technology |
Upload: | nakul-jeirath |
View: | 382 times |
Download: | 5 times |
A Journey From Relational to Graph
Trials and Tribulations on the Path to Graph
Introduction● Nakul Jeirath
● Senior security engineer at WellAware (wellaware.us)
● WellAware: Oil & gas startup building a SaaS monitoring & analytics platform
Wikipedia List of Graph DBs
https://en.wikipedia.org/wiki/Graph_database
Wikipedia List of Graph DBs
We use Titan+Cassandra
Transitioned ~2 years ago
Why Switch?Graph model allowed modeling of well pad and derived calculations
Why Switch?Graph model allowed modeling of well pad and derived calculations
Visualization built with http://js.cytoscape.org/
Overview● Quick graph overview + toy example
● Our journey○ Episode I: Development
○ Episode II: Migration
○ Episode III: Operation
Property Graph
Label: employee
name: Nakul
Label: company
name: WellAware
label: works for
hired: 9/13
A Toy Example
http://coachesbythenumbers.com/sportsource-college-football-data-packages/
2005 College Football Data
● Team names & conferences● Game record with dates and scores
● Interesting questions:○ Records for all teams in conference X○ Top 25 ranking using record + strength of opponents
○ Three team loop (A beat B beat C beat A)
● Source code: https://github.com/njeirath/titan-perf-tester
Toy Models
Label: team
name: Purdueconf: Big 10
Label: team
name: IUconf: Big 10
label: beat
date: 11/19/05score: 41-14
Teams
team_id
conference
name
Beat
winner
loser
win_score
lose_score
SQL
Graph
Episode I: DevelopmentSQL vs Gremlin
Developer Opinion
Example: Get Big 10 RecordsSQL
SELECT win_record.NAME, win_record.wins, Count(l) FROM (SELECT teams.team_id, teams.NAME AS NAME, Count(w) AS wins FROM teams JOIN beat AS w ON teams.team_id = w.winner WHERE conference = 'Big Ten Conference' GROUP BY teams.NAME, teams.team_id) AS win_record JOIN beat AS l ON team_id = l.loser GROUP BY win_record.NAME, win_record.wins ORDER BY win_record.wins DESC;
Gremlin
g.V().order().by(__.outE().count(), decr).has('conference', 'Big Ten Conference').as('team', 'wins', 'losses').select('team', 'wins', 'losses').by('name').by(__.outE().count()).by(__.inE().count())
Example: Top 25 RankingSQL
SELECT teams.name, ranks.rank FROM (SELECT beat.winner, Sum(rec.wins) AS rank FROM (SELECT teams.team_id, Count(w) AS wins FROM teams JOIN beat AS w ON w.winner = teams.team_id GROUP BY teams.team_id) AS rec JOIN beat ON beat.loser = rec.team_id GROUP BY beat.winner ORDER BY rank DESC LIMIT 25) AS ranks JOIN teams ON teams.team_id = ranks.winner ORDER BY ranks.rank DESC;
Gremlin
g.V().order().by(__.out().out().count(), decr).as('team', 'score', 'wins', 'losses').select('team', 'score', 'wins', 'losses').by('name').by(__.out().out().count()).by(__.outE().count()).by(__.inE().count()).limit(25)
/r/mildlyinteresting/1. Texas2. USC3. Penn State4. Ohio State5. Virginia Tech6. TCU7. West Virginia8. Lousianna State9. Alabama
10. Oregon11. Louisville12. Georgia13. UCLA14. Miami (FL)
1. Texas2. USC3. Penn State4. Virginia Tech5. LSU6. Ohio State7. Georgia8. TCU9. West Virginia
10. Alabama11. Boston College12. Oklahoma13. Florida14. UCLA
http://www.collegefootballpoll.com/2005_archive_computer_rankings.html
2005 End of Season Computer Rankings
Our Query Results
Developer Opinion● ORMs
○ Move to graph, lost Django ORM○ ORM/OGM option at the time was Totorom
● Query Language○ Gremlin seems more intuitive
Episode II: MigrationEssentially an ETL operation:
1. Export tables (table name --> vertex label, columns --> vertex properties)2. Export FK/Join tables (FK/Join table name --> edge label)
team_id conference name
559 Big 10 Purdue
306 Big 10 Indiana
...
winner loser win_score lose_score
559 306 41 14
...
Challenges:
● Dealing with indices● Migrating a production DB
Challenges with Index Relational DB indices are local per table, graph IDs are global
ID Name Teacher
1 Kyle 1
2 Stan 1
3 Kenny 1
...
ID Teacher
1 Garrison
...
student
pg_id: 1
teacher
pg_id: 1
Unique key isVertex label + pg_id
Migrating a Production DBPotentially large amounts of data - batch loading optimizations
Static
Time series
Step 1: Move static
Step 2: Reroute requests and data
Step 3: Move old TS
Episode III: Operating GraphUsual benefits of NoSQL
● Designed for scalability - built in sharding, redundancy, etc.○ Ex: Titan pluggable with Cassandra/HBase
● Usually allows on the fly schema changes○ Flexible migrations avoid DB downtime
Underlying DB technology requires expertise, tuning, monitoring, etc
PerformanceIf not considered early, OLTP performance can potentially be an issue
Consider Titan architecture:
Server
Titan JVM
Storage Backend
Gremlin evaluated here
g.V().has('name', 'Purdue').out('beat').values('name')
Index retrievalEdge traversalVertex property retrieval
Dealing with Performance● Understand storage structures
● Understand Cassandra characteristics○ Ex: Generally deletes are bad
● Talks on Titan+Cassandra tuning:○ Ted Wilmes - Cassandra Summit 2015:
■ Slides: http://www.slideshare.net/twilmes/modeling-the-iot-with-titandb-and-cassandra
■ Video: https://vimeopro.com/user35188327/cassandra-summit-2015/video/143695770
○ Nakul Jeirath - Graph Day TX:
http://s3.thinkaurelius.com/docs/titan/1.0.0/data-model.html
Our ApproachLots of real-time data, tiny bit of relatively static data
Some optimization, mostly caching of static data
Heavily optimized real-time
Static
Time series
Code Optimization + caching
Model changes + code optimization
Maturity of Graph● Query languages
○ SQL allows relatively ease of switching relational DB vendors
○ Tinkerpop for graph but not universally supported today
● Version upgrades○ Currently on Titan 0.4.4○ 0.4.4 --> 0.5.*: not storage compatible (require ETL to upgrade)○ 0.4.4 --> 1.*: not storage compatible, query code rewrite
Summary● Development
○ Gremlin easier to work with than SQL (opinion)
○ Tools for SQL more mature and varied but graph is catching up
● Migration○ Relational --> Graph generally requires ETL
● Operation○ NoSQL benefits of distributed, scalable, schemaless DBs○ Performance can be an issue if not considered early○ Graph vendor/version coupling but will improve with maturity
Thanks For Watching
Questions
Nakul Jeirath@njeirathSenior Security Engineer - WellAware