DIRTY DATA ANDHOW TO FIX ITShipServ Smart Procurement
2017, Hamburg
Georgina Gavin, Chief Commercial Officer 29.03.2017
AGENDA
1.What do we want from Big Data?
2.What do we mean by ‘Dirty Data’?
3.The importance of cleaning
4.Data discipline
5.Investing in people and technology
6.Non-technical overview of VV coding and database
7.Why go to all this effort?
INTRODUCTION TOVESSELSVALUE
INTRODUCTION TO VESSELSVALUE
SERVICES
VALUES
• Daily updated values for Vessels, Companies, Portfolios
• Tankers, Bulkers, Containers, LNG, LPG, PSVs, AHTSs, AHTs, MODUs
• Accuracy tested and reported
• Full supporting information
SERVICES
SEARCH
• Powerful, highly accurate, interactive database
• Search and compare by any combination of criteria
• Fleet search: vessels, companies, specifications, incidents, locations, laden/ballast
• Deals search: S&P, Newbuilds, Demolitions, Charters
SERVICES
MAP
• AIS Satellite and Terrestrial mapping/tracking
• GIS Maritime energy infrastructure (oil fields, platforms, pipelines, windfarms)
• Automated alerts (i.e. pre-defined sanction zones, OSV activity around rigs). We currently provide these to banks and regulators
BIG DATA ANDWHAT WE WANT
’’GURUS AMONG US HAVE PROCLAIMED 2017 WILL BE THE YEAR BIG DATA GOES MAINSTREAM’’
FORBES, JAN 31 2017
BIG DATA AND WHAT WE WANT
• Where you have too much data to comprehend and you’re receiving it so fast you struggle to process it
• AIS, for example
• New ways to process and store this data
• However, big data sets aren’t the challenge, it’s understanding what to do with them!
• The important shift we’re all looking for: rather than simply reflecting performance, big data needs to help drive business operations
WHAT IS IT?
HOW BIG IS BIG?
• Byte = one grain of rice
• Kilobyte = cup of rice
• Megabyte = 8 Sacks
• Gigabyte = 3 Trucks
• Terabyte = 2 Container Ships
• Petabyte = Area size of London
• Exabyte = Area size of UK
• Zettabyte = Fills the Pacific Ocean
• Yottabyte = Rice ball the size of Earth!
VV’S BIG DATA EXPLAINED
64,222Ships
47,639 Valuable
14.4MRows AIS Position data
Daily on average
5MCaptains Reports
75k changes
16.2Billion Rows
Archived
386MValuations
+1.5M by user request
Obviously depends on what your business is
Data into INFORMATION
• KPIs
• BI analytics
VV: AIS linked with economic data
• Analyse different types of risk; commercial, voyage risk related to environment and navigational safety…
• Define yourself how risk should be quantified, set your own parameters
• Identify opportunities to optimise your business today
• Identify NEW opportunities
• Solid information (data you can trust to be accurate and commercially sound) will make you better informed and give you confidence to make braver decisions
WHAT DO WE WANT TO ACHIEVE
• APIs provide access to specific datasets
• Datasets are dynamically updated, up to the minute ‘live’ information
• You, the receivers, can run complex queries and query the data using parameters (quality indicators) to specify your request
• Query the data at anytime
• Currently available in JSON, csv and xml formats
ADVANTAGES
➢Easy to implement
➢Reliable and proven technology
➢Receiver can instantaneously send feedback
➢Cloud storage now available to support
DATA DELIVERY VIA API (APPLICATION PROGRAMMING INTERFACE)
‘DIRTY DATA’
• Data is inevitably "dirty" thanks to obsolete, inaccurate,
and missing information.
• Cleaning it up is an increasingly important and overlooked
job that can help prevent costly mistakes
• Although techniques are improving all the time, scrubbing
data can only accomplish so much. Even when dealing with
a relatively tidy set of information, getting useful results
can be arduous and time-consuming.
• every single person in your organization must buy in to the
value analytics brings, from data gathering to
management. Reducing risk of dirty data
THERE ARE NO CLEAN DATA SETS!
Untangling ‘jumping ships’ and multiple ships reporting on same unique identifier
BEFORE
AFTER
MANUAL FIXING
Write algorithms to spot outliers and determine whether they are within an acceptable tolerance
• Because the volume of data is so huge, software can automatically sift through numbers and text to look for anything unusual that needs further review
• Over time, computers can improve their accuracy in spotting what's belongs and what doesn't. They can also better understand what words and phrases mean by clustering similar examples together and then grading their interpretations for accuracy. (AI)
• Remember models take time to improve
OUTLIERS
“Senior shipping executives need to start looking more closely into how data analytics can augment human decisions, while bringing the workforce up to speed…
…With technology changing rapidly today, the industry will develop slower than others if it does not harness and use big data successfully.’’ Oh Bee Lock, PSA
Many organisations are purchasing data but may not currently have the technical capabilities or the economic data that can be linked to produce useful analytics
Data processors vs data providers!
VV has large, dedicated team of mathematicians and developers with freedom to use best data available
HANDLING BIG DATA
CLEANING
• Understand what we’re receiving
• Consider potential problems now and in the future, for example unrealistic distance travelled, loss of signal
• Solution, flag or alert when one of those problems occurs
• Algorithms to automatically identify and fix
• Sometimes requires manual fixing, for example incorrect captain’s entry on AIS of ship’s MMSI number
• All of this happens real time 24-7
• A team of 20 continuously monitor and analyse our data to turn it into useful information
HOW DO WE HANDLE BIG DATA?
DATA DISCIPLINE
✓ Structured databases
Free form electronic notes
Data inputters need training to input data correctly
HELPFUL TECHNIQUES
Input validation
Standardised fields will help
Suggested drop downs
Outlier analysis – predefined correct ranges
Sister analysis
ORGANISING DATA
INVESTING IN PEOPLEAND TECHNOLOGY
87 STAFF
STOKE: 21 highly trained skilled programmers. Most have mathematics background. Product development for internal and external systems.
IOW: 45 dedicated researchers, data inputters
LONDON HQ: Commercial, analysts, economists, quants
SINGAPORE: Representative Office
VV OFFICES
59,164Development Hours
(between 2009 and 2016)
£4.1MCost @ £70/hour
18Experienced Developers
required to recode in a single year, given a complete spec
15.3 BillionRows of AIS Position Data
> World population (7.4B)
5,400Columns of Data
in 450+ Tables
10+TB of StorageTo store all VV Data
40Servers
22 Database, 18 Compute
2.8MLines of Code
29Development testing sites
THE FIGURES
2.8M LINES OF CODE
WHY ARE WE DOING THIS?
SEARCH THE DATABASE
TRADE ANALYTICS START AT VESSEL LEVEL
VESSEL LEVEL STOPPAGES
VESSEL LEVEL JOURNEYS
VESSEL LEVEL PROBABLE EVENTS
AGGREGATED UP TO COMPANY OR SECTOR LEVEL
AVERAGE SPEEDS & TON MILE
This is the flow of the requests for a single valuation. The darker colour highlights where the largest % of the time is spent.
It took 213ms (one fifth of a second) to calculate, log and return the values.
Knowledge like this allows us to continuously optimise and eliminate bottlenecks.
Once DCF was complete it took over 3 hours to value every ship, after a few days of optimisation it was down to 13 minutes.
PERFORMANCE AND SPEED
• Explore other modern ways support your business, embrace
change!
• Understand your own capabilities, be realistic – this will dictate
what plan you need to take
• Establish clear and simple goals
• Remain informed
• Question your suppliers and processors
• Demand transparency
• It’s not what you know. It’s what you do with what you know.
TO SUMMARISE
THANK YOU