Stop Worrying & Love the SQL - A Case Study

transcript

Stop WorryingLove the SQL! (the Quepid story)

OpenSource Connections

Me/UsDoug Turnbull@softwaredoug

Likes: Solr, Elasticsearch, Cassandra, Postgres

OpenSource Connections@o19s

Search, Discovery and Analytics

Let us introduce you to freelancing!

Most Importantly we do...Make my search results more relevant!

“Search Relevancy”

What database works best for problem X?“(No)SQL Architect/Trusted Advisor”

How products actually get built

Rena: Doug, John can you come by this afternoon?

One of our Solr-based products needs some urgent relevancy work

Its Friday, it needs to get done today!

Us: Sure!

The Client(Rena!)smart cookie!

A few hours laterUs: we’ve made a bit of progress!

image frustration-1081 by jseliger2

Rena: but everytime we fix something, we break an existing search!

Us: yeah! we’re stuck in a whack-a-mole-game

other image: whack a mole by jencu

Whack-a-MoleWhat search relevancy work actually looks like

I HAVE AN IDEA● Middle of the afternoon, I stop doing search

work and start throwing together some python

from flask import Flaskapp = Flask(__name__)

Everyone: Doug, stop that, you have important search work to do!

Me: We’re not making any progress!WE NEED A WAY TO REGRESSION TEST OUR RELEVANCY AS WE TUNE!

Everyone: You’re nuts!

What did I make?Focus on gathering stakeholder (ie Rena) feedback on search, coupled w/ workbench tuning against that feedback

Today we have customers...

… forget that, tell me about your failures!

Our war storyMy mistakes:

● Building a product● Selling a product● As a user experience engineer● As an Angular developer● At choosing databases

Quepid 0.0.0.0.0.0.1Track multiple user searches

for this query (hdmi cables) Rena rates this document as a good/bad search result

need to store:<search> -> <id for search result> -> <rating 1-10>“hdmi cables” -> “doc1234” -> “10”

*Actual UI may have been much uglier

Data structure selection under duress

● What’s simple, easy, and will persist our data?

● What plays well with python?

● What can I get working now in Rena’s office?

Redis● In memory “Data Structure Server”

○ hashes, lists, simple key-> value storage

● Persistent -- write to disk every X minutes

from redis import Redisredis = Redis()redis.set("foo", "bar")redis.get("foo") # gets ‘bar’

$ pip install redis

Easy to install and go! Specific to our problem:

from redis import Redisredis = Redis()

ratings = {“doc1234”: “10”, “doc532”: “5”}searchQuery = “hdmi cables”

redis.hsetall(searchQuery, ratings)

Store a hash table at “hdmi cables” with:

“doc1234” -> “10”“doc532” -> “5”

Success!● My insanity paid off that afternoon

● Now we’re left with a pile of hacked together (terrible) code -- now what?

Adding some features● Would like to add multiple “cases”

(different search projects that solve different problems)

● Would like to add user accounts

● Still a one-off for Silverchair

CasesTuning a cable shopping site... … vs state laws

Cases in Redis?

from redis import Redisredis = Redis()

redis.hset(searchQuery, ratings)

Recall our existing implementation“data model”

Out of the box, redis can deal with 2 levels deep:{

“hdmi cables”: {“doc1234”: “10”,“doc532”: “5”

},“ethernet cables”...

Can’t add extra layer (redis hash only one layer)

{“cable site”: {“hdmi cables”: {...}“ethernet cables”: {...}

}“laws site: {...}}

Time to give up Redis?“All problems in computer science can be solved by another level of indirection” -- David Wheeler

Crazy Idea: Add dynamic prefix to query keys to indicate case, ie:{

“case_cablestore_hdmi cables”: {“doc1234”: “10”,“doc532”: “5”

},“case_cablestore_ethernet cables”: {… },“case_statelaws_car tax”: { …}

Queries for “Cable Store” case

Query for “State Laws” case

redis.keys(“case_cablestore*”)

To Fetch:

Store other info about cases?New problem: we need to store some information about cases, case name, et

{“case_cablestore_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“case_cablestore_ethernet cables”: {… },“case_statelaws_car tax”: { …}

Where would it go here?{

“case_cablestore” {“name”: “cablestore”,“created” “20140101”

},“case_cablestore_query_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“case_cablestore_query_ethernet cables”:

{… },“case_statelaws_query_car tax”: { …}

Oh but let’s add usersExtrapolating on past patterns {

“user_doug” {“name”: “Doug”,“created_date”: “20140101”

},“user_doug_case_cablestore” {

“name”: “cablestore”,“created_date” “20140101”

},“user_doug_case_cablestore_query_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“user_doug_case_cablestore_query_ethernet cables”:

{… },“user_tom_case_statelaws_query_car tax”: { …}

}image: Rage Wallpaper from Flickr user Thoth God of Knowledge

You right now!

Step BackWe ask ourselves: Is this tool a product? Is it useful outside of this customer?

What level of software engineering helps us move forward?

● Migrate to RDMS?● “NoSQL” options?● Clean up use of Redis somehow?

SubRedis

Operationalizes hierarchy inside of redis

https://github.com/softwaredoug/subredis

from redis import Redisfrom subredis import SubRedisredis = Redis()

sr = SubRedis(“case_%s” % caseId , redis)

sr.hsetall(searchQuery, ratings)

Create a redis sandbox for this case

Interact with this case’s queries with redis sandbox specific to that case

Behind the scenes, subredis queries/appends the case_1 prefix to everything

SubRedis == composable

userSr = SubRedis(“user_%s” % userId , redis)

caseSr = SubRedis(“case_%s” % caseId , userSr)

# Sandbox redis for queries about userratings = {“doc1234”: “10”, “doc532”: “5”}searchQuery = “hdmi cables”

caseSr.hsetall(searchQuery, ratings)

SubRedis takes any Redis like thing, and works safely in that sandbox

Now working on sandbox, within a sandbox

Does something reasonable under the hood

“user_1_name”: “Doug”,“user_1_created_date”: “Doug”,“user_1_case_1_name”: “name”: “cablestore”“user_1_case_1_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“user_2_name”, “Rena”,...

AllRedis

user_1 subred.

case_1subred.

We reflect again● Ok we tried this out as a product. Launched.

● Paid off *some* tech debt, but wtf are we doing

● Works well enough, we’ve got a bunch of new features, forge ahead

We reflect again● We have real customers

● Our backend is evolving away from simple key-value storage○ user accounts? users that share cases? stored

search snapshots? etc etc

Attack of the relationalGiven our current set of tools, how would we solve the problem“case X can be shared between multiple users”?

“user_1_name”: “Doug”,“user_1_created_date”: “Doug”,“user_1_case_1_name”: “name”: “cablestore”“user_1_case_1_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“user_2_name”, “Rena”,“user_2_case_1_name”: “name”: “cablestore”“user_2_case_1_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

Could duplicate the data? This stinks!

● Updates require visiting many (every?) user, looking for this case

● Bloated database

Duplicate the data?

Attack of the relationalGiven our current set of tools, how would we solve the problem“case X can be shared between multiple users”?

“user_1_name”: “Doug”,“user_1_created_date”: “Doug”,“user_1_cases”: [1, ...]“case_1_name”: “name”: “cablestore”“case_1_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“user_2_name”, “Rena”,“user_2_cases”: [1, ...]...

User 1

Case 1

User 2

Store list of owned cases

Break out cases to a top-level record?

SudRedisRelational?{

“user_1_name”: “Doug”,“user_1_created_date”: “Doug”,“user_1_cases”: [1, ...]“case_1_name”: “name”: “cablestore”“case_1_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“user_2_name”, “Rena”,“user_2_cases”: [1, ...]...

We’ve actually just normalized our data.

Why was this good?● We want to update case 1 in isolation

without anomalies● We don’t want to visit every user to

update case 1!● We want to avoid duplication

We just made our “NoSQL” database a bit relational

Other Problems● Simple CRUD tasks like “delete a case”

need to be coded up

● We’re managing our own record ids

● Is any of this atomic? does it occur in isolation?

What’s our next DB?● These problems are hard, we need a new

● We also need better tooling!

Irony● This is the exact situation we warn clients

about in our (No)SQL Architect Roles.○ Relational == General Purpose○ Many-many, many-one, one-many, etc○ Relational == consistent tooling

○ NoSQL == solve specific problems well

So we went relational!● Took advantage of great tooling: MySQL,

Sqlalchemy (ORM), Alembic (migrations)

● Modeled our data relationships exactly like we needed them to be modeled

Map db Python classes

class SearchQuery(Base): __tablename__ = 'query' id = Column(Integer, primary_key=True) search_string = Column(String) ratings = relationship("QueryRating")

class QueryRating(Base): __tablename__ = 'rating' id = Column(Integer, primary_key=True) doc_id = Column(String) rating = Column(Integer)

Can model my domain in coder-friendly classes class SearchQuery(Base):

__tablename__ = 'query' id = Column(Integer, primary_key=True) search_string = Column(String) ratings = relationship("QueryRating")

class QueryRating(Base): __tablename__ = 'rating' id = Column(Integer, primary_key=True) doc_id = Column(String) rating = Column(Integer)

Easy CRUDq = SearchQuery(search_string=”hdmi cable”)db.session.add(q)db.session.commit()

del q.ratings[0]db.session.add(q)db.session.commit()

q = SearchQuery.query.filter(id=1).one()q.search_string=”foo”db.session.add(q)db.session.commit()

Create!

Delete!

Update!

Migrations are good

alembic revision --autogenerate -m "name for tries"alembic upgrade headalembic downgrade 0ab51c25c

How do you upgrade your database to add/move/reorganize data?

● Redis this was always done manually/scripted

● Migrations with RDMS are a very robust/well-understood way to handle this

SQLAlchemy has “alembic” to help:

Modeling Users ←→ Casesassociation_table = Table(case2users, Base.metadata, Column('case_id', Integer, ForeignKey('case.id')), Column('user_id', Integer, ForeignKey('user.id')))

class User(Base): __tablename__ = 'user' id = Column(Integer, primary_key=True) cases = relationship("Case", secondary=association_table)

class Case(Base): __tablename__ = 'case' id = Column(Integer, primary_key=True)

Can model many-many relationships

Ultimate Query Flexibilityfor user in User.query.all(): for case in user.cases: print case.caseName

for user in User.query.filter(User.isPaying==True): for case in user.cases: print case.caseName

Print all cases:

Cases from paying members:

Lots of things easier● backups● robust hosting services (RDS)● industrial strength ACID with flexible

querying● 3rd-party tooling (ie VividCortex for MySQL)

When NoSQL?● Solve specific problems well

○ Optimize for specific query patterns○ Full-Text Search (Elasticsearch, Solr)○ Caching, shared data structure (Redis)

● Optimize for specific scaling problems○ Provide a denormalized “view” of your data for

specific task

Final ThoughtsSometimes RDMS’s have harder initial hurdle for setup, figuring out migrations; data modeling; etc

Why isn’t the easy path the wise path?

In conclusion

Stop Worrying & Love the SQL - A Case Study

Technology