Date post: | 20-Jan-2015 |
Category: |
Technology |
Upload: | henok80 |
View: | 686 times |
Download: | 1 times |
22nd October 2012
Python <3 Content systems- managing millions of tracks for the masses
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
> 15 M active users*
* Users active within the previous 30 daysTuesday, October 23, 12
> Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 daysTuesday, October 23, 12
> 18 M tracks
> Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 daysTuesday, October 23, 12
> 18 M tracks
> 20 k new tracks added per day
> Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 daysTuesday, October 23, 12
> 18 M tracks
> 1 century of listening
> 20 k new tracks added per day
> Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 daysTuesday, October 23, 12
> 18 M tracks
> 1 century of listening
> 20 k new tracks added per day
> 500 M playlists
> Available in 15 Countries
> 15 M active users*
* Users active within the previous 30 daysTuesday, October 23, 12
Service overview
Tuesday, October 23, 12
Service overview
Storage
Tuesday, October 23, 12
Service overview
Storage
User
Tuesday, October 23, 12
Service overview
Storage
User
Search
Tuesday, October 23, 12
Service overview
Storage
User
Search
Metadata
Tuesday, October 23, 12
Service overview
...
Storage
User
Search
Metadata
Tuesday, October 23, 12
Service overview
...
Storage
User
Search
Metadata
AP
Tuesday, October 23, 12
Service overview
...
Storage
User
Search
Metadata
AP
Tuesday, October 23, 12
Service overview
...
Storage
User
Search
Metadata
AP
Tuesday, October 23, 12
Service overview
...
Storage
User
Search
Metadata
AP
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Content pipeline
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Tuesday, October 23, 12
XMLXMLXMLXML
Background image: lord enfield (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/
Ingestion
Tuesday, October 23, 12
Ingestion: Delivery formats
Tuesday, October 23, 12
Ingestion: Delivery formats
~ 10 different incoming XML formats
Tuesday, October 23, 12
Ingestion: Delivery formats
~ 10 different incoming XML formats
- Proprietary formats (majors)
Tuesday, October 23, 12
Ingestion: Delivery formats
~ 10 different incoming XML formats
- Proprietary formats (majors)
- Spotify delivery format (mostly indies)
Tuesday, October 23, 12
Ingestion: Delivery formats
~ 10 different incoming XML formats
- Proprietary formats (majors)
- Spotify delivery format (mostly indies)
Thousands of lines of source specific code
Tuesday, October 23, 12
Data model [simplified]
Album
Track
Artist
Disc
Rights
Audio
*
*
*
*
*
1
1
1
*
*
1
1
Transcoding
1
*
Tuesday, October 23, 12
Ingestion
LXML and XSLT with extensions for parsing/transforming XML
Tuesday, October 23, 12
Ingestion: XPath extensions
>>> def formerlify(_, name):... return 'The artist formerly known as %s' %name
>>> #Namespace stuff>>> from lxml import etree>>> ns = etree.FunctionNamespace('http://my.org/myfunctions')>>> ns['hello'] = hello>>> ns.prefix = 'f'
>>> root = etree.XML('<a><b>Prince</b></a>')>>> print(root.xpath('f:hello(string(b))'))
... The artist formerly known as Prince
http://lxml.de/extensions.html#xpath-extension-functions
Tuesday, October 23, 12
Ingestion
Tuesday, October 23, 12
IngestionFun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk space
Tuesday, October 23, 12
IngestionFun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk space
Bible apparently fits in 3MB XML
Tuesday, October 23, 12
Ingestion
>>> timeit.timeit('e.parse("huge.xml")', setup='import lxml.etree as e', number=5) / 5 4.19...
>>> timeit.timeit('e.parse("huge.xml")', setup='import xml.etree.cElementTree as e', number=5) / 5 4.78...
>>> timeit.timeit('e.parse("huge.xml")', setup='import xml.etree.ElementTree as e', number=5) / 5 55.39...
Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk space
Bible apparently fits in 3MB XML
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Content pipeline
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Merge
Tuesday, October 23, 12
Centralized vs. aggregated cataloging
Requires merging!
Requires humans!
Tuesday, October 23, 12
Image: Nicolas Genin (CC BY 2.0) http://www.flickr.com/photos/22785954@N08
Metadata - challenges
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Content pipeline
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Merge
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Merge
Curation/enrichment
Tuesday, October 23, 12
Ambiguous artists - thesis work
Tuesday, October 23, 12
Ambiguous artists - thesis work
• User input
Tuesday, October 23, 12
Ambiguous artists - thesis work
• User input
• Machine learning
Tuesday, October 23, 12
Ambiguous artists - thesis work
• User input
• Machine learning
• Matching against external sources
Tuesday, October 23, 12
Ambiguous artists - thesis work
• User input
• Machine learning
• Matching against external sources
• Feature selection (#matches per external source, len(name), country-count, multilingual)
Tuesday, October 23, 12
Ambiguous artists - thesis work
• User input
• Machine learning
• Matching against external sources
• Feature selection (#matches per external source, len(name), country-count, multilingual)
• Matchings + preprocessing in Python
Tuesday, October 23, 12
Content matching
(16 * 10 ** 6) ** 2
Tuesday, October 23, 12
Content matching
(16 * 10 ** 6) ** 2 = A large number
Tuesday, October 23, 12
Content matching
(16 * 10 ** 6) ** 2 = A large number
Reduce search space: >>> from unicodedata import normalize>>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5]
Tuesday, October 23, 12
Content matching
(16 * 10 ** 6) ** 2 = A large number
Reduce search space: >>> from unicodedata import normalize>>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5]
Side note: Levenshtein (edit) distance is a heavy operation
-> speeded up about 4x with pypy (or use c-extension)
Tuesday, October 23, 12
Automatic data processing will never be perfect
Tuesday, October 23, 12
Automatic data processing will never be perfect
Patch it!
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Content pipeline
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Merge
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Merge
Curation/enrichment
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Transcoding
Merge
Curation/enrichment
Tuesday, October 23, 12
Transcoding
Asynchronous
RabbitMQ + amqplib
Master / workers
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Content pipeline
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Merge
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Merge
Curation/enrichment
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Transcoding
Merge
Curation/enrichment
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Indexing
Transcoding
Merge
Curation/enrichment
Tuesday, October 23, 12
Index build
Tuesday, October 23, 12
Index build
• Nightly batch job on db-dumps
Tuesday, October 23, 12
Index build
• Nightly batch job on db-dumps
• Previously mostly python but now moved to Java for performance reason
Tuesday, October 23, 12
Index build
• Nightly batch job on db-dumps
• Previously mostly python but now moved to Java for performance reason
• But still lots of python helper scripts :)
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Content pipeline
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Merge
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Merge
Curation/enrichment
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Transcoding
Merge
Curation/enrichment
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Indexing
Transcoding
Merge
Curation/enrichment
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Indexing
Transcoding On site live services, e.g. search, browse
Publishing
Merge
Curation/enrichment
Tuesday, October 23, 12
Distribution/publish Service A
Service B
Service C
Tuesday, October 23, 12
Distribution/publish
Index A
Index B
Index C
Service A
Service B
Service C
Tuesday, October 23, 12
Distribution/publish
Index A
Index B
Index C
Service A
Service B
Service C
Tuesday, October 23, 12
Distribution/publish
Index A
Index B
Index C
Service A
Service B
Service C
Tuesday, October 23, 12
Distribution/publish
Index A
Index B
Index C
Service A
Service B
Service C
Tuesday, October 23, 12
Distribution/publish
Index A
Index B
Index C
Service A
Service B
Service C
Tuesday, October 23, 12
Scheduling being migrated to ZooKeeper
image: http://www.flickr.com/photos/seattlemunicipalarchives/with/3797940791/
Tuesday, October 23, 12
Distribution/publish
Staged rollout
Tuesday, October 23, 12
Distribution/publish
Tuesday, October 23, 12
Distribution/publish
Exponential back-off
Tuesday, October 23, 12
Distribution/publish
Exponential back-offwaiting 5s ...
Tuesday, October 23, 12
Distribution/publish
Exponential back-offwaiting 5s ...waiting 10s ...
Tuesday, October 23, 12
Distribution/publish
Exponential back-offwaiting 5s ...waiting 10s ...waiting 30s ...
Tuesday, October 23, 12
Distribution/publish
Exponential back-offwaiting 5s ...waiting 10s ...waiting 30s ...waiting 60s ...
Tuesday, October 23, 12
Label A
Label B
Label C
Label D
Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Ingestion
Content pipeline
Indexing
Transcoding On site live services, e.g. search, browse
Publishing
Merge
Curation/enrichment
Tuesday, October 23, 12
Store ’da data
Tuesday, October 23, 12
Choice of database
Tuesday, October 23, 12
Choice of database
Depends on the use case - duh!
Tuesday, October 23, 12
Choice of database
Depends on the use case - duh!
• PostgreSQL (e.g. user service)
Tuesday, October 23, 12
Choice of database
Depends on the use case - duh!
• PostgreSQL (e.g. user service)
• Cassandra (e.g. playlist service)
Tuesday, October 23, 12
Choice of database
Depends on the use case - duh!
• PostgreSQL (e.g. user service)
• Cassandra (e.g. playlist service)
• Tokyo cabinet (e.g. browse service)
Tuesday, October 23, 12
Choice of database
Depends on the use case - duh!
• PostgreSQL (e.g. user service)
• Cassandra (e.g. playlist service)
• Tokyo cabinet (e.g. browse service)
• Lucene (search service)
Tuesday, October 23, 12
Choice of database
Depends on the use case - duh!
• PostgreSQL (e.g. user service)
• Cassandra (e.g. playlist service)
• Tokyo cabinet (e.g. browse service)
• Lucene (search service)
• HDFS
Tuesday, October 23, 12
PostgreSQL
[Pic. of elephant]
Image: http2007 (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/Tuesday, October 23, 12
PostgreSQL
Redundancy + scaling: master/slave
Tuesday, October 23, 12
PostgreSQL
Joins and subqueries - let the query planner roll!
Tuesday, October 23, 12
PostgreSQL
Python?
Tuesday, October 23, 12
PostgreSQL
Python?- psycopg2 + SQL-queries
- SQLAlchemy migrator for versioning of db-schemas
Tuesday, October 23, 12
PostgreSQL
Python?- psycopg2 + SQL-queries
- SQLAlchemy migrator for versioning of db-schemas
Server side, aka named, cursors:conn = psycopg2.connect(database='huge_db', user='postgres', password='secret')sscursor = conn.cursor('my_cursor')sscursor.execute('SELECT * FROM big_table')rows = sscursor.fetchmany(1000)...
Tip!
Tuesday, October 23, 12
Scaling the content pipeline
What to scale for?
Tuesday, October 23, 12
Scaling the content pipeline
Size of catalog
Tuesday, October 23, 12
Scaling the content pipeline
# Users
Tuesday, October 23, 12
Distribution/publish
Popen + gevent (although IO-bound)import gevent
gevent.monkey.patch_all()
def _wait(self): while True: res = self.poll() if res is not None: return res gevent.sleep(0.1)
subprocess.Popen.wait = _wait
Tuesday, October 23, 12