How (and why!) to build a Django based project with ... · How (and why!) to build a Django based...

transcript

How (and why!) to build a Django based project with

SQLAlchemy Core for data analysis

Hi!I’m Gleb PushkovSoftware developer 6+ years(Python & Django)

Kyiv, Ukraine

glib.pushkov@gmail.com

https://github.com/glebtor

Link to slides

Why do we need SQLAlchemy Core in Django app?

your application mostly works with aggregations

you have a lot of data

you need precise and performant queries

you’re building advanced queries dynamically

you’re transforming complex queries from SQL to Python

database is not natively supported by Django

(e.g SQL Azure, Sybase, Firebird)

You’re building some kind of Data-Analysis app, e.g.:

Subqueries

Window functions

FilteredRelation

Conditional Expressions

Date, Math, Text functions

Custom db constraints

Cool new features:

You have fewer reasons to switch to raw SQL!

But...Django ORM has its specific

Property.objects.filter(city__startswith='K').select_related('owner')[:5]

SELECT"properties"."id","users"."username"...

FROM "properties"LEFT OUTER JOIN "users" ON ("properties"."owner_id" = "users"."id")WHERE "properties"."city" LIKE 'K%'LIMIT 5

SQL getting simpler - python query getting complexProperty.objects

.filter(city__startswith='K')

.annotate(owner_name=F('owner__username'))

.values('id', 'owner_name')[:5]

SELECT"properties"."id","users"."username" as "owner_name"

SQL getting simpler - python query getting complexProperty.objects

.filter(city__startswith='K')

.annotate(owner_name=F('owner__username'))

.values('id', 'owner_name')[:5]

SELECT"properties"."id","users"."username" as "owner_name"

We JOIN all rows, and only then we have a LIMIT.

Explain

What we want to get

SELECT"properties_by_city"."id","users"."username"FROM (SELECT "properties"."id" AS "id", "properties"."owner_id" AS "owner_id" FROM "properties" WHERE "properties"."city" LIKE 'K%'LIMIT 5) AS "properties_by_city"LEFT OUTER JOIN "users"ON "users"."id" = "properties_by_city"."owner_id"

Looks good!

Explain

How it will look for SQLAlchemy Core

properties_by_city = (select([properties.c.uuid, properties.c.owner_id]).select_from(properties).where(properties.c.city.like('K%')).limit(5).alias()

query = select([properties_by_city.c.uuid, users.c.username]).select_from(properties_by_city.outerjoin(users))

How it will look for Django ORM

You can’t build such queries!

In ORM world everything is tied to models (in python) and tables (in db)

We can use `Subquery` in SELECT, WHERE, HAVING, but not in FROM. The “root” of query is a model/table.

How similar query will look for Django ORM

properties_by_city = (Property.objects.filter(city__startswith='K')[:5].values('pk')

Property.objects.filter(pk__in=Subquery(properties_by_city)).annotate(owner_name=F('owner__username')).values('pk', 'owner_name')

Looks good!

Explain

With django ORM you get everything / with SQLAlchemy Coreyou get only what you asked - they’re on different layers

Django ORM

Non-public API

Raw SQL

SQLAlchemy ORM

SQLAlchemy Core

Raw SQL

Model.objects.filter(...).values(...).annotate(...).filter(...)

select([...]).select_from(...).where(...).group_by(...).having(...)

Django ORM SQLAlchemy

SELECT ...FROM ...WHERE ...GROUP BY ...HAVING ...

There is a distance between SQL & ORM layer,so sometimes it’s not clear which query will be generated

There is no freedom on ORM level!

SQLAlchemy example

Select properties by criteria

level1 = (select([

properties.c.building_id,properties.c.sale_price,properties.c.owner_id

]).select_from(properties).where(properties.c.selling_status=='for_sale').where(properties.c.sale_price!=None).alias()

Join usernames

level2 = (select([

level1.c.building_id,level1.c.sale_price,users.c.username

]).select_from(level1.outerjoin(users)).alias()

Group by

level3 = (select([

level2.c.building_id, func.count(level2.c.building_id).label('apartments_count'), func.sum(level2.c.sale_price).label('sum_price'), func.array_agg(level2.c.username).label('users')

]).select_from(level2).group_by(level2.c.building_id).alias()

One more join at the top

level4 = (select([

properties.c.total_apartments,level3.c.apartments_count,level3.c.sum_price,level3.c.users

]).select_from(level3.join(properties, properties.c.uuid==level3.c.building_id)

SELECT properties.number_of_units, anon_1.apartments_count, anon_1.sum_price, anon_1.usersFROM (SELECT anon_2.building_id AS building_id, count(anon_2.building_id) AS apartments_count, sum(anon_2.sale_price) AS sum_price, array_agg(anon_2.username) AS users FROM (SELECT anon_3.building_id AS building_id, anon_3.sale_price AS sale_price, users.username AS username FROM (

level4 - join

level3 - group by

level2 - join

SELECT properties.building_id AS building_id, properties.sale_price AS sale_price, properties.owner_id AS owner_id FROM properties WHERE properties.selling_status = :selling_status_1 AND properties.sale_price IS NOT NULL) AS anon_3LEFT OUTER JOIN users ON users.uuid = anon_3.owner_id) AS anon_2 GROUP BY anon_2.building_id) AS anon_1 JOIN properties ON properties.uuid = anon_1.building_id

level1

Wow! SQLAlchemy <3

Aggregation in subquery

SELECT * FROM “properties” WHERE "properties"."sale_price" = (

SELECT MIN("properties"."sale_price") FROM "properties" WHERE "properties"."sale_price" > 1000000

SQL we want to get

Situation: we want to find a first price which is bigger than 1’000’000, and then get all properties with exact price.

So we need to make a MIN aggregation.

Aggregation in subquery

min_price = Property.objects .filter(sale_price__gte=1000000) .aggregate(Min('sale_price'))['sale_price__min'] # evaluated :(

Property.objects.filter(sale_price=min_price)

SELECT * FROM "properties" WHERE "properties"."sale_price" = 1234567

We performed two separate queries

Aggregation in subquery - take 2 - subquery

min_price = Listing.objects.filter(sale_price__gte=1000000).values('sale_price').order_by('sale_price')[:1]

Property.objects.filter(sale_price=Subquery(min_price))

... WHERE "properties"."sale_price" = (SELECT U0."sale_price" FROM "properties" U0 WHERE U0."sale_price" >= 1000000 ORDER BY U0."sale_price" ASC LIMIT 1)

Now we have ORDER BY and LIMIT...too complicated for code & db

Aggregation in subquery - take 3 - not recommended

min_price_queryset = Property.objects.filter(sale_price__gte=1000000)min_price_queryset.query.add_annotation(

Min('sale_price'), 'min_price', is_summary=True)Property.objects.filter(

sale_price=Subquery(min_price_queryset.values('min_price')))

WHERE "properties"."sale_price" >= (SELECT MIN(U0."sale_price") AS "min_price" FROM "listings" U0 WHERE U0."sale_price" >= 1000000)

Aggregation in subquery - take 4 - template

class MinSalePrice(Subquery):template = "(SELECT MIN(sale_price) FROM (%(subquery)s) _subq)"output_field = models.IntegerField()

filtered_properties = Property.objects.filter(sale_price__gte=1000000)

Property.objects.filter(sale_price=MinSalePrice(filtered_properties))

Generated SQL is fine, but such approach is some kind of a hack and ....

Aggregation in subquery - SQLAlchemy

min_price = (select([func.min(properties.c.sale_price)]).select_from(properties)

query = (select([properties.c.id]).select_from(properties).where(properties.c.sale_price==min_price)

Joins:

You can't join tables of non-related models

You can't perform RIGHT OUTER JOIN… yes, it’s very rare :)

Django decides for you which join type to apply (INNER or LEFT OUTER)

and always generates

JOIN "table2" ON ("table1"."table2_id" = "table2"."id")

which could be a bit customized by FilteredRelation

Not supported by Django (yet)

Recursive CTE

raw SQL

django-cte-forest (implemented via ‘extra’, has limitations)

SQLAlchemy

Could be done via:

Combining multiple aggregations © Django docs <3

>>> book = Book.objects.first()>>> book.authors.count()2>>> book.store_set.count()3>>> q = Book.objects.annotate(Count('authors'), Count('store'))>>> q[0].authors__count6>>> q[0].store__count

Count('field', distinct=True) will fix this query, but other aggregations will not work as expected!

Hard to read advanced queries

Hard to understand what’s going on in SQL-level

Takes time & effort to convert SQL to python

Can’t control / change some parts of generated SQL

Queries could be not efficient

To sum up...

Usually all above is not a problem in 95%* of cases

* Just a number from my head

your application mostly works with aggregations

you have a lot of data

you need precise and performant queries

you’re transforming complex queries from SQL to Python

you’re building advanced queries dynamically

database is not natively supported by Django

(e.g SQL Azure, Sybase, Firebird)

But only when you’re building some kind of Data-Analysis app, e.g.:

Ok, how to start??

1. Create `Engine` as a global variable and describe your connection

Engine Database

Dialect

DBAPIconnect()

QueuePool is default, to disable pooling use NullPool

sa_engine = create_engine(settings.DB_CONNECTION_URL,pool_recycle=settings.POOL_RECYCLE

Pooling

Postgres

8 threads4 uWSGI workers

Django connection

SQLAlchemy (NullPool)

pgbouncer

1 instance produce up to 64 connections1 connection to Postgres ~ 10 MB of RAM124 connections == ~1.2 GB of RAM

2. Define tables

Re-use django models (aldjemy)

Table reflection (django-sabridge)

If you have models for tables:

Table reflection ( messages = Table('messages', meta, autoload=True)

Define explicitly

Define inline with expressions

No models:

Define explicitly

users = Table('users', metadata,Column('id', Integer, primary_key=True),Column('username', String(150), nullable=False),Column('email', String(254)),Column('role', String(64), nullable=False)

Define explicitly

Or even keep it simple, but to simplify `join` describe ForeignKeys:

users = Table('users', metadata,Column('id'),Column('username'),Column('email'),Column('role')

)properties = Table('properties', metadata,

Column('owner_id', None, ForeignKey('users.id')),...

Each returned row is a RowProxy:

all_users = engine.execute(select([users.c.username, users.c.email]).select_from(users)

).fetchall()

[('johndoe', 'john_doe@example.com'),('janedoe', 'jane_doe@example.com')]

all_users[0].username / all_users[0][‘username’] / all_users[0][0]

Define inline with expressions

from sqlalchemy import table, columnengine.execute(

select([column('username'), column('email'),]).select_from(table('users'))).fetchall()

# Or if you need columns to be associated with tables:user = table(

‘user’, column(‘id’), column(‘username’),) queries like: select([user.c.username, ...

That’s all, start building your fancy queries!

But what about tests?

Switch connection!

def create_sa_engine(connection_url): extra = {...} if "pytest" in sys.modules: connection_url = _get_test_db_url(connection_url) return create_engine(connection_url, **extra)

engine = create_sa_engine(settings.REMOTE_DB_CONNECTION_URL):

As `engine` is a global variable it will be evaluated earlier than any call of django.test.override_settings or some other approaches.Test db will be created by Django (if it’s listed in settings)

Pytest + ResultProxy cursor issue

return self.process_rows(result_proxy) # iterates over ResultProxy

If you have an exception during iteration over a cursor (ResultProxy)pytest will hang on forever

We have to close cursor explicitly

def close_cursors(func): @wraps(func) def wrapper(*args, **kwargs): try: return func(*args, **kwargs) except Exception as e: for arg in chain(args, kwargs.values()): if isinstance(arg, ResultProxy): arg.close() raise e return wrapper

@close_cursorsdef process_rows(result_proxy: ResultProxy) ...

TestCases & connections

TestCase - wraps test with a transaction and performs a rollback

TransactionTestCase - code is not wrapped with transaction, truncate all tables

Django connection

SQLAlchemy connectionsDatabaseApplication

Read Committed

When to use TestCase

if you write tests for code which works only with 1 of connections

and test data is populated via the same connection

if tables populated via SQLAlchemy connection - you have

to clean up tables by yourself;

it's possible to share data between connections, but it requires

to change transaction isolation level (READ_UNCOMMITED),

but I would not recommend.

Keep in mind

When to use TransactionTestCase

test code which works with both connections.

(no issues because of autocommit behavior)

this tests are slower

models related tables flushed automatically.

other tables have to be cleaned up by yourself (if you have such)

Keep in mind

Drawbacks

A bit hard to start

Can't easily get a final SQL query with parameters

Slower tests

More connections to database

Can't reuse libraries which work with querysets (e.g django-filters, pagination)

Benefits

Full control over SQL

Faster to express SQL in Python code

Easier to build application-specific SQL-generation layer

Readability & maintainability

Performance

Questions?

Thank you for attention! Link to slides

How (and why!) to build a Django based project with ... · How (and why!) to build a Django based...

Documents