Stop Worrying & Love the SQL - A Case Study

Post on 15-Jul-2015

781 views 1 download

transcript

Stop WorryingLove the SQL! (the Quepid story)

OpenSource Connections

Me/UsDoug Turnbull@softwaredoug

Likes: Solr, Elasticsearch, Cassandra, Postgres

OpenSource Connections@o19s

Search, Discovery and Analytics

Let us introduce you to freelancing!

OpenSource Connections

Most Importantly we do...Make my search results more relevant!

“Search Relevancy”

What database works best for problem X?“(No)SQL Architect/Trusted Advisor”

OpenSource Connections

How products actually get built

Rena: Doug, John can you come by this afternoon?

One of our Solr-based products needs some urgent relevancy work

Its Friday, it needs to get done today!

Us: Sure!

The Client(Rena!)smart cookie!

OpenSource Connections

A few hours laterUs: we’ve made a bit of progress!

image frustration-1081 by jseliger2

Rena: but everytime we fix something, we break an existing search!

Us: yeah! we’re stuck in a whack-a-mole-game

other image: whack a mole by jencu

OpenSource Connections

Whack-a-MoleWhat search relevancy work actually looks like

OpenSource Connections

I HAVE AN IDEA● Middle of the afternoon, I stop doing search

work and start throwing together some python

from flask import Flaskapp = Flask(__name__)

Everyone: Doug, stop that, you have important search work to do!

Me: We’re not making any progress!WE NEED A WAY TO REGRESSION TEST OUR RELEVANCY AS WE TUNE!

Everyone: You’re nuts!

OpenSource Connections

What did I make?Focus on gathering stakeholder (ie Rena) feedback on search, coupled w/ workbench tuning against that feedback

Today we have customers...

… forget that, tell me about your failures!

OpenSource Connections

Our war storyMy mistakes:

● Building a product● Selling a product● As a user experience engineer● As an Angular developer● At choosing databases

OpenSource Connections

Quepid 0.0.0.0.0.0.1Track multiple user searches

for this query (hdmi cables) Rena rates this document as a good/bad search result

need to store:<search> -> <id for search result> -> <rating 1-10>“hdmi cables” -> “doc1234” -> “10”

*Actual UI may have been much uglier

OpenSource Connections

Data structure selection under duress

● What’s simple, easy, and will persist our data?

● What plays well with python?

● What can I get working now in Rena’s office?

OpenSource Connections

Redis● In memory “Data Structure Server”

○ hashes, lists, simple key-> value storage

● Persistent -- write to disk every X minutes

OpenSource Connections

Redis

from redis import Redisredis = Redis()redis.set("foo", "bar")redis.get("foo") # gets ‘bar’

$ pip install redis

Easy to install and go! Specific to our problem:

from redis import Redisredis = Redis()

ratings = {“doc1234”: “10”, “doc532”: “5”}searchQuery = “hdmi cables”

redis.hsetall(searchQuery, ratings)

Store a hash table at “hdmi cables” with:

“doc1234” -> “10”“doc532” -> “5”

OpenSource Connections

Success!● My insanity paid off that afternoon

● Now we’re left with a pile of hacked together (terrible) code -- now what?

OpenSource Connections

Adding some features● Would like to add multiple “cases”

(different search projects that solve different problems)

● Would like to add user accounts

● Still a one-off for Silverchair

OpenSource Connections

CasesTuning a cable shopping site... … vs state laws

OpenSource Connections

Cases in Redis?

from redis import Redisredis = Redis()

ratings = {“doc1234”: “10”, “doc532”: “5”}searchQuery = “hdmi cables”

redis.hset(searchQuery, ratings)

Recall our existing implementation“data model”

Out of the box, redis can deal with 2 levels deep:{

“hdmi cables”: {“doc1234”: “10”,“doc532”: “5”

},“ethernet cables”...

}

Can’t add extra layer (redis hash only one layer)

{“cable site”: {“hdmi cables”: {...}“ethernet cables”: {...}

}“laws site: {...}}

OpenSource Connections

Time to give up Redis?“All problems in computer science can be solved by another level of indirection” -- David Wheeler

Crazy Idea: Add dynamic prefix to query keys to indicate case, ie:{

“case_cablestore_hdmi cables”: {“doc1234”: “10”,“doc532”: “5”

},“case_cablestore_ethernet cables”: {… },“case_statelaws_car tax”: { …}

}

Queries for “Cable Store” case

Query for “State Laws” case

redis.keys(“case_cablestore*”)

To Fetch:

OpenSource Connections

Store other info about cases?New problem: we need to store some information about cases, case name, et

{“case_cablestore_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“case_cablestore_ethernet cables”: {… },“case_statelaws_car tax”: { …}

}

Where would it go here?{

“case_cablestore” {“name”: “cablestore”,“created” “20140101”

},“case_cablestore_query_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“case_cablestore_query_ethernet cables”:

{… },“case_statelaws_query_car tax”: { …}

}

OpenSource Connections

Oh but let’s add usersExtrapolating on past patterns {

“user_doug” {“name”: “Doug”,“created_date”: “20140101”

},“user_doug_case_cablestore” {

“name”: “cablestore”,“created_date” “20140101”

},“user_doug_case_cablestore_query_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“user_doug_case_cablestore_query_ethernet cables”:

{… },“user_tom_case_statelaws_query_car tax”: { …}

}image: Rage Wallpaper from Flickr user Thoth God of Knowledge

You right now!

OpenSource Connections

Step BackWe ask ourselves: Is this tool a product? Is it useful outside of this customer?

What level of software engineering helps us move forward?

● Migrate to RDMS?● “NoSQL” options?● Clean up use of Redis somehow?

OpenSource Connections

SubRedis

Operationalizes hierarchy inside of redis

https://github.com/softwaredoug/subredis

from redis import Redisfrom subredis import SubRedisredis = Redis()

sr = SubRedis(“case_%s” % caseId , redis)

ratings = {“doc1234”: “10”, “doc532”: “5”}searchQuery = “hdmi cables”

sr.hsetall(searchQuery, ratings)

Create a redis sandbox for this case

Interact with this case’s queries with redis sandbox specific to that case

Behind the scenes, subredis queries/appends the case_1 prefix to everything

OpenSource Connections

SubRedis == composable

userSr = SubRedis(“user_%s” % userId , redis)

caseSr = SubRedis(“case_%s” % caseId , userSr)

# Sandbox redis for queries about userratings = {“doc1234”: “10”, “doc532”: “5”}searchQuery = “hdmi cables”

caseSr.hsetall(searchQuery, ratings)

SubRedis takes any Redis like thing, and works safely in that sandbox

Now working on sandbox, within a sandbox

OpenSource Connections

Does something reasonable under the hood

{

“user_1_name”: “Doug”,“user_1_created_date”: “Doug”,“user_1_case_1_name”: “name”: “cablestore”“user_1_case_1_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“user_2_name”, “Rena”,...

}

AllRedis

user_1 subred.

case_1subred.

OpenSource Connections

We reflect again● Ok we tried this out as a product. Launched.

● Paid off *some* tech debt, but wtf are we doing

● Works well enough, we’ve got a bunch of new features, forge ahead

OpenSource Connections

We reflect again● We have real customers

● Our backend is evolving away from simple key-value storage○ user accounts? users that share cases? stored

search snapshots? etc etc

OpenSource Connections

Attack of the relationalGiven our current set of tools, how would we solve the problem“case X can be shared between multiple users”?

{

“user_1_name”: “Doug”,“user_1_created_date”: “Doug”,“user_1_case_1_name”: “name”: “cablestore”“user_1_case_1_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“user_2_name”, “Rena”,“user_2_case_1_name”: “name”: “cablestore”“user_2_case_1_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},}

Could duplicate the data? This stinks!

● Updates require visiting many (every?) user, looking for this case

● Bloated database

Duplicate the data?

OpenSource Connections

Attack of the relationalGiven our current set of tools, how would we solve the problem“case X can be shared between multiple users”?

{

“user_1_name”: “Doug”,“user_1_created_date”: “Doug”,“user_1_cases”: [1, ...]“case_1_name”: “name”: “cablestore”“case_1_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“user_2_name”, “Rena”,“user_2_cases”: [1, ...]...

}

User 1

Case 1

User 2

Store list of owned cases

Break out cases to a top-level record?

OpenSource Connections

SudRedisRelational?{

“user_1_name”: “Doug”,“user_1_created_date”: “Doug”,“user_1_cases”: [1, ...]“case_1_name”: “name”: “cablestore”“case_1_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“user_2_name”, “Rena”,“user_2_cases”: [1, ...]...

}

We’ve actually just normalized our data.

Why was this good?● We want to update case 1 in isolation

without anomalies● We don’t want to visit every user to

update case 1!● We want to avoid duplication

We just made our “NoSQL” database a bit relational

OpenSource Connections

Other Problems● Simple CRUD tasks like “delete a case”

need to be coded up

● We’re managing our own record ids

● Is any of this atomic? does it occur in isolation?

OpenSource Connections

What’s our next DB?● These problems are hard, we need a new

DB

● We also need better tooling!

OpenSource Connections

Irony● This is the exact situation we warn clients

about in our (No)SQL Architect Roles.○ Relational == General Purpose○ Many-many, many-one, one-many, etc○ Relational == consistent tooling

○ NoSQL == solve specific problems well

OpenSource Connections

So we went relational!● Took advantage of great tooling: MySQL,

Sqlalchemy (ORM), Alembic (migrations)

● Modeled our data relationships exactly like we needed them to be modeled

OpenSource Connections

Map db Python classes

class SearchQuery(Base): __tablename__ = 'query' id = Column(Integer, primary_key=True) search_string = Column(String) ratings = relationship("QueryRating")

class QueryRating(Base): __tablename__ = 'rating' id = Column(Integer, primary_key=True) doc_id = Column(String) rating = Column(Integer)

Can model my domain in coder-friendly classes class SearchQuery(Base):

__tablename__ = 'query' id = Column(Integer, primary_key=True) search_string = Column(String) ratings = relationship("QueryRating")

class QueryRating(Base): __tablename__ = 'rating' id = Column(Integer, primary_key=True) doc_id = Column(String) rating = Column(Integer)

OpenSource Connections

Easy CRUDq = SearchQuery(search_string=”hdmi cable”)db.session.add(q)db.session.commit()

del q.ratings[0]db.session.add(q)db.session.commit()

q = SearchQuery.query.filter(id=1).one()q.search_string=”foo”db.session.add(q)db.session.commit()

Create!

Delete!

Update!

OpenSource Connections

Migrations are good

alembic revision --autogenerate -m "name for tries"alembic upgrade headalembic downgrade 0ab51c25c

How do you upgrade your database to add/move/reorganize data?

● Redis this was always done manually/scripted

● Migrations with RDMS are a very robust/well-understood way to handle this

SQLAlchemy has “alembic” to help:

OpenSource Connections

Modeling Users ←→ Casesassociation_table = Table(case2users, Base.metadata, Column('case_id', Integer, ForeignKey('case.id')), Column('user_id', Integer, ForeignKey('user.id')))

class User(Base): __tablename__ = 'user' id = Column(Integer, primary_key=True) cases = relationship("Case", secondary=association_table)

class Case(Base): __tablename__ = 'case' id = Column(Integer, primary_key=True)

Can model many-many relationships

OpenSource Connections

Ultimate Query Flexibilityfor user in User.query.all(): for case in user.cases: print case.caseName

for user in User.query.filter(User.isPaying==True): for case in user.cases: print case.caseName

Print all cases:

Cases from paying members:

OpenSource Connections

Lots of things easier● backups● robust hosting services (RDS)● industrial strength ACID with flexible

querying● 3rd-party tooling (ie VividCortex for MySQL)

OpenSource Connections

When NoSQL?● Solve specific problems well

○ Optimize for specific query patterns○ Full-Text Search (Elasticsearch, Solr)○ Caching, shared data structure (Redis)

● Optimize for specific scaling problems○ Provide a denormalized “view” of your data for

specific task

OpenSource Connections

Final ThoughtsSometimes RDMS’s have harder initial hurdle for setup, figuring out migrations; data modeling; etc

Why isn’t the easy path the wise path?

OpenSource Connections

In conclusion