8/14/2019 Datacamp ETL Documentation
1/19
www.knowerce.sk
Datacamp ETLDocumentation
November 2009
knowerce|consulting
8/14/2019 Datacamp ETL Documentation
2/19
Document information
Creator Knowerce, s.r.o.
Vavilovova 16
851 01 Bratislava
www.knowerce.sk
Author tefan Urbnek, [email protected]
Date of creation 12.11.2009
Document revision 1
Document Restrictions
Copyright (C) 2009 Knowerce, s.r.o., Stefan Urbanek
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU
Free Documentation License, Version 1.3 or any later version published by the Free Software
Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the
license is included in the section entitled "GNU Free Documentation License".
knowerce|consulting
Offer [email protected] 2
8/14/2019 Datacamp ETL Documentation
3/19
Contents
....................................................................................................................................................................................Introduction 4
.........................................................................................................................................................................................Overview 6
System Context 6
Objects and classes 6
........................................................................................................................................................................................Installation 8
Software Requirements 8
Preparation 8
Database initialisation 8
Configuration 9
......................................................................................................................................................................Running ETL Jobs 10
Launching 10
Manual Launching 10
Scheduled using cron 10
Running Programatically 10
What jobs will be run 10
Job Status
........................................................................................................................................................................ Job Management
Scheduling 12
Forced run 12
.............................................................................................................................................................Creating a Job Bundle 14
Example: Public Procurement Extraction ETL job 14
Job Utility Methods 14
Errors and Failing a Job 14
...........................................................................................................................................................................................Defaults 15
ETL System Defaults 15
Using defaults in jobs 15
.............................................................................................................................................................Appendix: ETL Tables 17
etl_jobs 17
etl_job_status 17
etl_defaults 18
etl_batch 18
.............................................................................................................................................................................Cron Example 19
knowerce|consulting
Offer [email protected] 3
8/14/2019 Datacamp ETL Documentation
4/19
Introduction
This document describes architecture, structures and process of Datacamp Extraction Transformation
and Loading framework. Purpose of the framework is to perform automated scheduled dataprocessing, usually in the background. Main features:
scheduled or manual launching of ETL jobs
job management and configuration through database
logging
ETL job plug-in API
ETL tools provided:
parallel URL downloader
record transformation functions
table comparisons
table mappings
knowerce|consulting
Offer [email protected] 4
8/14/2019 Datacamp ETL Documentation
5/19
Project Page and Sources
Project page with sources can be found:
http://github.com/Stiivi/Datacamp-ETL
Wiki Documentation:
http://wiki.github.com/Stiivi/Datacamp-ETL/
Related project Datacamp:
http://github.com/Stiivi/datacamp
Support
General Discussion Mailing List
http://groups.google.com/group/datacamp
Development Mailing List (recommended for Datacamp-ETL project):
http://groups.google.com/group/datacamp-dev
knowerce|consulting
Offer [email protected] 5
http://groups.google.com/group/datacamp-devhttp://groups.google.com/group/datacamp-devhttp://groups.google.com/group/datacamphttp://groups.google.com/group/datacamphttp://github.com/Stiivi/datacamphttp://github.com/Stiivi/datacamphttp://wiki.github.com/Stiivi/Datacamp-ETL/http://wiki.github.com/Stiivi/Datacamp-ETL/http://github.com/Stiivi/Datacamp-ETLhttp://github.com/Stiivi/Datacamp-ETL8/14/2019 Datacamp ETL Documentation
6/19
Overview
System Context
Datacamp ETL framework has plug-in based architecture and runs on top of a database server.
Objects and classes
Core of the ETL framework are Job Manager and Job objects. There are two categories of classes: job
management and utility classes that are not necessary for data processing.
Class Description and provided functionality
Batch Information about data processed by ETL
Download BatchList of files and additional information for automated parallel downloading and
processing
DB Server
ETL
ETL StagingDatabase
directory for extracted andtemporary files
job module bundle
job module bundle
job module bundle
ETL modulesdirectory (directories)
Job Management Utilities
Job Manager
Job InfoJob Status Job
Extraction Transformation Loading
Download Manager
Download Batch
ETL Defaults
Batch
knowerce|consulting
Offer [email protected] 6
8/14/2019 Datacamp ETL Documentation
7/19
Class Description and provided functionality
Download Manager Performs parallel download of huge amount of URLs
ETL Defaults Stores configuration variables in key-value dictionary
Job Abstract class for ETL jobs, provides utilities for running, logging and error handling
Job Info Information about job: name, type, scheduling,
Job Manager Configures and launches jobs, handles errors.
Job Status Information about job run: when was run, what was the result and reason for failure.
knowerce|consulting
Offer [email protected] 7
8/14/2019 Datacamp ETL Documentation
8/19
Installation
Software Requirements
database server1
ruby
rails
gems: sequel
Preparation
I. create a directory where working files, such as dumps and ETL files, will be stored, for example:
/var/lib/datacamp
II. create a database. For use with Datacamp web application create two schemas:
data schema, example: datacamp_data
staging schema (for ETL), example: datacamp_staging
III. create a database user that has full access (SELECT, INSERT, UPDATE, CREATE TABLE, ) to
the datacamp ETL schemas
Check: at this point you should have:
sources
working directory
one or two database schemas
database user with appropriate permissions
Database initialisation
To initialize ETL database schema run appropriate SQL script from install directory, for example:
mysql -u root -p datacamp_staging < install/etl_tables.mysql.sql
knowerce|consulting
Offer [email protected] 8
1
currently works only with MySQL server as there are couple of MySQL specific code residues. This will changein the future.
8/14/2019 Datacamp ETL Documentation
9/19
8/14/2019 Datacamp ETL Documentation
10/19
Running ETL Jobs
Launching
Manual Launching
Jobs are being run with simply launching the etl.rb script:
ruby etl.rbThe script looks for config.yml in current directory. You can pass another configuration file:
ruby etl.rb --config another_config.yml
Scheduled using cron
You would mostly like to run ETL automatically and periodically. To do so, configure a cron job for the
Datacamp ETL by creating a cron script. There is an example in install/etl_cron_job, where you
have to change ETL_PATH, CONFIG and probabblyRUBY variables. See appendix where example file is
listed.
Running Programatically
Or configure JobManager manually and run all jobs by:
job_manager = JobManager.new # configure job_manager herejob_manager.run_scheduled_jobs
Log is being written to preconfigured file or to standard error output. See Installation instructions how
to configure the log file.
What jobs will be run
By default only jobs that are enabled and scheduled for this day and were not run successfully already.
If all jobs succeed, then any subsequent launch of ETL should not run any jobs. All unsuccessful are
being re-tried. Not enabled jobs are not run on any occasion. For more information see Job
Management.
knowerce|consulting
Offer [email protected] 10
8/14/2019 Datacamp ETL Documentation
11/19
Job Status
Each job leaves a footprint of its run in etl_job_status table. The table contains information:
Column Description
job_name task which was run
job_id identifier of the job
status current status of the job: ok, running, failed
phase if job has more phases, this column identifies which phase the job is in
message error message on job fail
start_date when the job started
end_date when the job finished, or NULL if job is still running
Possible job statuses are:
running job is still running (or ETL crashed and did not reset the job status)
ok job finished correctly
failed job dod not finished correctly, see phase and message for more information
Example of successful runs you want to achieve this:
Example of mixed statuses, including failed ones:
knowerce|consulting
Offer [email protected] 11
8/14/2019 Datacamp ETL Documentation
12/19
Job Management
Jobs are managet through etl_jobs table where you specify:
Column Description
job_name name of a job (see below)
job_type type of a job: extraction, transformation, loading,
is_enabled set to 1 when the task is enabled
run_ordernumber which specifies order in which jobs are being run. Jobs are run from lowest
number to highest. If number is the same for more jobs, behaviour is undefined
schedule when the job is being run
force_run run despite scheduling rule
Example:
To add a new job, insert a line into the table and set job information. To remove a job just delete a line.
Scheduling
Jobs can be currently scheduled on daily basis:
daily: run each day
monday, tuesday, wednesday, thursday, friday, saturday, sunday run on particular week
day
Once the job was successfully run by scheduler, the job manager does not run it again unless explicitly
specified byforce_run flag.
Forced runThere is a way how to run jobs out-of-schedule by setting the force_run flag. This allows data
managers to re-run an ETL job remotely without requiring access to the system where ETL processes
are being hosted. The job will be run next time scheduler is run. For example: if ETL is scheduled in
cron for hourly run, then the job is re-run within next hour, if it is scheduled for daily runs it will be run
next day.
The flag is reset to 0 after each run to prevent running again. Reason for this behaviour is to prevent
running lengthy, time and CPU consuming jobs unintentionally and to protect already processed data
from possible inconsistencies introduced by running jobs at unexpected times.
knowerce|consulting
Offer [email protected] 12
8/14/2019 Datacamp ETL Documentation
13/19
This behaviour can be modified using ETL system defaults:
force_run_all run all enabled jobs, regardless of their scheduling time
reset_force_run_flag allow jobs to be re-run each time ETL script is launched. Set this to 0 for
development and testing.
knowerce|consulting
Offer [email protected] 13
8/14/2019 Datacamp ETL Documentation
14/19
Creating a Job Bundle
Jobs are implemented by bundles or in other words directories containing all necessary code and
information for the job. Only requirement for the bundle is that it follows certain naming conventionand contains ruby script with the job class.
bundle directory should be named: job_name.job_type
bundle should contain ruby file: job_name_job_type.rb
the ruby file should contain camelized job name and job type class: JobNameJobType which should
be a subclass of appropriate job subclass (Extraction, Transformation, Loading)
The class should implement run method with the main job code.
Example: Public Procurement Extraction ETL job
I. create a job bundle directory: mkdir public_procurement.extractionII. create a Ruby file: public_procurement.extraction/public_procurement_extraction.rb
III. implement a class named: PublicProcurementExtraction:
class PublicProcurementExtraction < Extraction
def run job code goes here
end
Job Utility Methods
There are several utility methods for job writers:
files_directory directory where working, extracted, downloaded and temporary files are stored.
This directory is job specific each job has its own directory by default
logger object for writing into ETL manager log
message, phase set job status information
Also each job has access to defaults dictionary. See chapter about Defaults for more information.
Errors and Failing a Job
It is recommended to raise exception on error. The exception will be handled by job manager and thejob will be closed properly with appropriate status and message set.
raise unable to connect to data source
will result in failed job with same message as the exception.
knowerce|consulting
Offer [email protected] 14
8/14/2019 Datacamp ETL Documentation
15/19
Defaults
Defaults is configurable key-value dictionary used by ETL jobs and the ETL system as well. The key-
value pairs are stored by domains. Domain usually corresponds to job name, for example: invoicesloading job and invoices transformation job share common domain invoices. The domain etl is reserved
for ETL system configuration. Purpose of defaults is to be able to configure ETL jobs remotely and in
more convenient way.
Defaults are stored in etl_defaults table which contains: domain, default_key and value:
ETL System Defaults
Key Description
Default Value
(if key-value does not exist)
force_run_all On next ETL run all enabled jobs are launched, regardless of
their scheduling. See Running ETL?
FALSE
reset_force_run_flag After running forced job (see Running ETL?) clear its flag so it
will be not run again.
TRUE
Using defaults in jobs
Job has access to defaults domain based on the job name. To retrieve a value from defaults:
url = defaults[:download_url]count = defaults[:count].to_i
Retrieve value or set to default value if not found:
batch_size = defaults.value(:batch_size, 200).to_iThis will look for batch_size key, if it does not exist, then the key will be created and assigned value
200.
To store default value:
defaults[:count] = countValues are committed when job finishes.
Example:
knowerce|consulting
Offer [email protected] 15
8/14/2019 Datacamp ETL Documentation
16/19
@batch_size = defaults.value(:batch_size, 200).to_i@download_threads = defaults.value(:download_threads, 10).to_i@download_fail_threshold = defaults.value(:download_fail_threshold, 10).to_i
knowerce|consulting
Offer [email protected] 16
8/14/2019 Datacamp ETL Documentation
17/19
Appendix: ETL Tables
etl_jobs
Column Type Description
id int object identifier
name varchar job name
job_type varchar job type
is_enabled int flag whether the job is run or not
run_order int order in which the jobs are being run. If more jobs have same order numer, the
behaviour is undefined.
last_run_date datetime date and time when job was alst run
last_run_status varchar status of last run
schedule varchar how the job is scheduled
force_run int force job to be run next time ETL runs
etl_job_status
Column Type Description
id int object identifier
job_name varchar job name
job_id int job identifier
status varchar current or last run status
phase varchar phase in which the job currently is wile running or was when finished
message varchar status message provided by job object or exception message
start_date datetime when the job was run
end_date datetime when the job finished
knowerce|consulting
Offer [email protected] 17
8/14/2019 Datacamp ETL Documentation
18/19
etl_defaults
Column Type Description
id int association id
domain varchar domain name (usually corresponds to job name)
default_key varchar key
value varchar value for key
etl_batch
Column Type Description
id int
batch_type varchar
batch_source varchar
data_source_name varchar
data_source_url varchar
valid_due_date date
batch_date date
username varchar
created_at datetime
updated_at datetime
knowerce|consulting
Offer [email protected] 18
8/14/2019 Datacamp ETL Documentation
19/19
Cron Example
#!/bin/bash
## ETL cron job script## Ubuntu/Debian: Put this script in /etc/cron.daily# Other unces: schedule appropriately in /etc/crontab
###################################################################### ETL Configuration
# Path to your ETL installationETL_PATH=/usr/lib/datacamp-etl
# Configuration file (database connection and other paths)CONFIG=$ETL_PATH/config.yml
# Ruby interpreter pathRUBY=/usr/bin/ruby
#####################################################################
ETL_TOOL=etl.rb$RUBY -I $ETL_PATH $ETL_PATH/$ETL_TOOL --config $CONFIG
knowerce|consulting
Offer info@knowerce sk 19