+ All Categories
Home > Documents > Python business intelligence (PyData 2012 talk)

Python business intelligence (PyData 2012 talk)

Date post: 21-Nov-2014
Category:
Upload: stefan-urbanek
View: 11,282 times
Download: 0 times
Share this document with a friend
Description:
What is the state of business intelligence tools in Python in 2012? How Python is used for data processing and analysis? Different approaches for business data and scientific data. Video: https://vimeo.com/53063944
58
Python for Business Intelligence Štefan Urbánek @Stiivi [email protected] PyData NYC, October 2012
Transcript
Page 1: Python business intelligence (PyData 2012 talk)

Python for Business Intelligence

Štefan Urbánek ■ @Stiivi ■ [email protected] ■ PyData NYC, October 2012

Page 2: Python business intelligence (PyData 2012 talk)

python business intelligence

)

Page 3: Python business intelligence (PyData 2012 talk)

Q/A and articles with Java solution references

(not listed here)

Results

Page 4: Python business intelligence (PyData 2012 talk)
Page 5: Python business intelligence (PyData 2012 talk)

Why?

Page 6: Python business intelligence (PyData 2012 talk)

Overview

■ Traditional Data Warehouse

■ Python and Data

■ Is Python Capable?

■ Conclusion

Page 7: Python business intelligence (PyData 2012 talk)

Business Intelligence

Page 8: Python business intelligence (PyData 2012 talk)

people

technology processes

Page 9: Python business intelligence (PyData 2012 talk)

Data Governance

Analysis and Presentation

Extraction, Transformation, LoadingData

Sources

Technologies and Utilities

Page 10: Python business intelligence (PyData 2012 talk)

Traditional Data Warehouse

Page 11: Python business intelligence (PyData 2012 talk)

■ Extracting data from the original sources

■ Quality assuring and cleaning data

■ Conforming the labels and measures in the data to achieve consistency across the original sources

■ Delivering data in a physical format that can be used by query tools, report writers, and dashboards.

Source: Ralph Kimball – The Data Warehouse ETL Toolkit

Page 12: Python business intelligence (PyData 2012 talk)

Source Systems

Staging Area Operational Data Store Datamarts

structured documents

databases

APIs

TemporaryStaging Area

staging relational dimensional

L0 L1 L2

Page 13: Python business intelligence (PyData 2012 talk)

real time = daily

Page 14: Python business intelligence (PyData 2012 talk)

Multi-dimensionalModeling

Page 16: Python business intelligence (PyData 2012 talk)

aggregation browsingslicing and dicing

Page 17: Python business intelligence (PyData 2012 talk)

business / analyst’spoint of view

regardless of physical schema implementation

Page 18: Python business intelligence (PyData 2012 talk)

Facts

fact

most detailed information

measurable

fact data cell

Page 19: Python business intelligence (PyData 2012 talk)

dimensions

location

type

time

Page 20: Python business intelligence (PyData 2012 talk)

■ provide context for facts

■ used to filter queries or reports

■ control scope of aggregation of facts

Dimension

Page 21: Python business intelligence (PyData 2012 talk)

Pentaho

Page 22: Python business intelligence (PyData 2012 talk)

Python and Datacommunity perception*

*as of Oct 2012

Page 23: Python business intelligence (PyData 2012 talk)

Scientific & Financial

Page 24: Python business intelligence (PyData 2012 talk)

Python

Page 25: Python business intelligence (PyData 2012 talk)

Data Governance

Analysis and Presentation

Extraction, Transformation, LoadingData

Sources

Technologies and Utilities

Page 26: Python business intelligence (PyData 2012 talk)

T1[s] T2[s] T3[s] T4[s]

P1 112,68 941,67 171,01 660,48

P2 96,15 306,51 725,88 877,82

P3 313,39 189,31 41,81 428,68

P4 760,62 983,48 371,21 281,19

P5 838,56 39,27 389,42 231,12

n-dimensional array of numbers

Scientific Data

Page 27: Python business intelligence (PyData 2012 talk)

Assumptions

■ data is mostly numbers

■ data is neatly organized...

■ … in one multi-dimensional array

Page 28: Python business intelligence (PyData 2012 talk)

Data Governance

Analysis and Presentation

Extraction, Transformation, LoadingData

Sources

Technologies and Utilities

Page 29: Python business intelligence (PyData 2012 talk)

Business Data

Page 30: Python business intelligence (PyData 2012 talk)

multiple representations

of same data

multiple snapshots of one source

categories are

changing

Page 31: Python business intelligence (PyData 2012 talk)

Page 32: Python business intelligence (PyData 2012 talk)

Is Python Capable?very basic examples

Page 33: Python business intelligence (PyData 2012 talk)

Data Pipes with SQLAlchemy

Data Governance

Analysis and Presentation

Extraction, Transformation, LoadingData

Sources

Technologies and Utilities

Page 34: Python business intelligence (PyData 2012 talk)

■ connection: create_engine

■ schema reflection: MetaData, Table

■ expressions: select(), insert()

Page 35: Python business intelligence (PyData 2012 talk)

src_engine = create_engine("sqlite:///data.sqlite")src_metadata = MetaData(bind=src_engine)src_table = Table('data', src_metadata, autoload=True)

target_engine = create_engine("postgres://localhost/sandbox")target_metadata = MetaData(bind=target_engine)target_table = Table('data', target_metadata)

Page 36: Python business intelligence (PyData 2012 talk)

for column in src_table.columns: target_table.append_column(column.copy())

target_table.create()

insert = target_table.insert()

for row in src_table.select().execute(): insert.execute(row)

clone schema:

copy data:

Page 37: Python business intelligence (PyData 2012 talk)

magic used:

metadata reflection

Page 38: Python business intelligence (PyData 2012 talk)

reader = csv.reader(file_stream)

columns = reader.next()

for column in columns: table.append_column(Column(column, String))

table.create()

for row in reader: insert.execute(row)

text file (CSV) to table:

Page 39: Python business intelligence (PyData 2012 talk)

Simple T from ETL

Data Governance

Analysis and Presentation

Extraction, Transformation, LoadingData

Sources

Technologies and Utilities

Page 40: Python business intelligence (PyData 2012 talk)

transformation = [

('fiscal_year', {"w function": int, ". field":"fiscal_year"}), ('region_code', {"4 mapping": region_map, ". field":"region"}), ('borrower_country', None), ('project_name', None), ('procurement_type', None), ('major_sector_code', {"4 mapping": sector_code_map, ". field":"major_sector"}), ('major_sector', None), ('supplier', None), ('contract_amount', {"w function": currency_to_number, ". field": 'total_contract_amount'} ]

target fields source transformations

Page 41: Python business intelligence (PyData 2012 talk)

for row in source: result = transform(row, [ transformation) table.insert(result).execute()

Transformation

Page 42: Python business intelligence (PyData 2012 talk)

OLAP with Cubes

Data Governance

Analysis and Presentation

Extraction, Transformation, LoadingData

Sources

Technologies and Utilities

Page 43: Python business intelligence (PyData 2012 talk)

cubes dimensionsmeasures levels, attributes, hierarchy

Model{ “name” = “My Model” “description” = ....

“cubes” = [...] “dimensions” = [...]}

Page 44: Python business intelligence (PyData 2012 talk)

logical

physical

Page 45: Python business intelligence (PyData 2012 talk)

workspace.browser(cube)

load_model("model.json")

create_workspace("sql", model, url="sqlite:///data.sqlite")

model.cube("sales")

Aggregation Browser backend

cubes

Application

1

2

3

4

Page 46: Python business intelligence (PyData 2012 talk)

browser.aggregate(o cell, . drilldown=[9 "sector"])

drill-down

Page 47: Python business intelligence (PyData 2012 talk)

q row.label k row.key

for row in result.table_rows(“sector”):

row.record["amount_sum"]

Page 48: Python business intelligence (PyData 2012 talk)

✂ cut = PointCut(9 “date”, [2010])o cell = o cell.slice(✂ cut)

browser.aggregate(o cell, drilldown=[9 “date”])

2006 2007 2008 2009 2010

Total

Jan Feb Mar Apr March April May ...

whole cube

o cell = Cell(cube)browser.aggregate(o cell)

browser.aggregate(o cell, drilldown=[9 “date”])

Page 49: Python business intelligence (PyData 2012 talk)

How can Python be Useful

Page 50: Python business intelligence (PyData 2012 talk)

■ saves maintenance resources

■ shortens development time

■ saves your from going insane

Languagejust the

Page 51: Python business intelligence (PyData 2012 talk)

Source Systems

Staging Area Operational Data Store Datamarts

structured documents

databases

APIs

TemporaryStaging Area

staging relational dimensional

L0 L1 L2

faster

Page 52: Python business intelligence (PyData 2012 talk)

Data Governance

Analysis and Presentation

Extraction, Transformation, LoadingData

Sources

Technologies and Utilities

faster advanced

understandable, maintainable

Page 53: Python business intelligence (PyData 2012 talk)

Conclusion

Page 54: Python business intelligence (PyData 2012 talk)

people

technology processes

BI is about…

Page 55: Python business intelligence (PyData 2012 talk)

don’t forget metadata

Page 56: Python business intelligence (PyData 2012 talk)

who is going to fix your COBOL Java toolif you have only Python guys around?

Future

Page 57: Python business intelligence (PyData 2012 talk)

is capable, let’s start


Recommended