Internet Measurement and Data Analysis (1)
Kenjiro Cho
2012-09-26
introductionhow does the entire Internet look like?
lumeta internet mapping http://www.lumeta.com
http://www.cheswick.com/ches/map/
2 / 39
introduction (cont’d)
how does the entire Internet look like?
I no one knows
I but, everyone is interested
the theme of the classI looking at the Internet from different views
I how to measure what is difficult to measureI how to extract useful information from huge data sets
this kind of approach will be increasingly important in the futureinformation society
3 / 39
Internet measurement and data analysis
I Faculty: Kenjiro Cho 〈[email protected]〉I TA: Yohei Kuga 〈[email protected]〉I SA: Yukito Ueno 〈[email protected]〉I URL: http://web.sfc.keio.ac.jp/∼kjc/classes/
sfc2012f-measurement/
I support email (facaulty, TA, SA): 〈[email protected]〉I textbooks, references: the lecture slide materials will be
provided online.
I programming: data processing exercises by Ruby
I evaluation: 2 assignments and a final report
4 / 39
what you will learn in the class
I how to understand statistical aspects of data, and how toprocess and visualize data
I which should be useful for writing thesis and other reports
I programming skills to process a large amount of dataI beyound what the existing package software provide
I ability to suspect statistical resultsI the world is full of dubious statistical results and infomation
manipulationsI (improving literacy on online privacy)
5 / 39
Big Data everywhere
6 / 39
big data by cloud computing
I “big data” becomes a trendy word, especially for marketingI most technologies are not new
I have been used in search ranking, online recommendersystems, etc.
I big data processing used to be limited to big organizationsthat could collect, manage, and analyze data in-house
I now, anyone can easily use big data with cloud services
I package tools are available for collecting and analyzing onlinecustomer behaviors
I customer information can be easily used for marketing withminimal initial investment
7 / 39
the age of data
I big data is not just for marketing
I technological innovations known as the data revolution areoccurring in every field
I previously difficult applications become possibleI access to huge amount of data, analysis of data constantly
being updated, and applications to non-linear models
I big data analysis becomes an indispensable research methodin all areas of science and technology
8 / 39
example: impact to science
e-science: paradaigm shift?
I theory
I experiment
I simulations (enabled by computer)
I data-driven discovery (enabled by big data)
9 / 39
example: Internet vehicle experimentsI by WIDE Project In Nagoya in 2001
I location, speed and wiper usage data from 1,570 taxisI blue areas indicate high ratio of wiper usage, showing rainfall
in detail
10 / 39
Japan EarthquakeI the system is now part of ITSI usable roads info released 3 days after the quake
I data provided by HONDA (TOYOTA, NISSAN)
11 / 39
Google’s Chief Economist Hal Varian on StatisticsThe McKinsey Quarterly, January 2009
“I keep saying the sexy job in the next ten years will be statisticians. People think I’mjoking, but who would’ve guessed that computer engineers would’ve been the sexy jobof the 1990s? The ability to take data — to be able to understand it, to process it, toextract value from it, to visualize it, to communicate it — that’s going to be a hugelyimportant skill in the next decades, not only at the professional level but even at theeducational level for elementary school kids, for high school kids, for college kids.Because now we really do have essentially free and ubiquitous data. So thecomplimentary scarce factor is the ability to understand that data and extract valuefrom it.”
12 / 39
self-introductionKenjiro Cho
I positionsI Research Director, IIJ Research LabI Guest Professor, Keio SFCI Adjunct Professor, JAISTI Board member, WIDE Project
I bioI BE in electronics from Kobe University in 1984.
I started as a hardware engineer at Canon, Inc, then becameinterested in operating systems
I M.Eng in computer science from Cornell University in 1993I studied computer science and distributed systems
I Researcher at Sony Computer Science Labs from 1996I research on the Internet
I Ph.D. (Media and Governance) from Keio University in 2001I Researcher at IIJ from 2004
I research topicsI Internet measurement and managementI networking support in operating systems
13 / 39
class overview
It becomes possible to access a huge amount of diverse data through theInternet. It allows us to obtain new knowledge and create new services, leadingto an innovation called ”Big Data” or ”Collective Intelligence”. In order tounderstand such data and use it as a tool, one needs to have a goodunderstanding of the technical background in statistics, machine learning, andcomputer network systems.
In this class, you will learn about the overview of large-scale data analysis on
the Internet, and basic skills to obtain new knowledge from massive information
for the forthcoming information society.
14 / 39
class overview (cont’d)
Theme, Goals, MethodsIn this class, you will learn about data collection and data analysis methods onthe Internet, to obtain knowledge and understanding of networkingtechnologies and large-scale data analysis.
Each class will provide specific topics where you will learn the technologies and
the theories behind the technologies. In addition to the lectures, each class
includes programming exercises to obtain data analysis skills through the
exercises.
PrerequisitesThe prerequisites for the class are basic programming skills and basicknowledge about statistics.
In the exercises and assignments, you will need to write programs to process
large data sets, using the Ruby scripting language and the Gnuplot plotting
tool. To understand the theoretical aspects, you will need basic knowledge
about algebra and statistics. However, the focus of the class is to understand
how mathematics is used for engineering applications.
15 / 39
class schedule (1/5)
I Class 1 Introduction (9/26)I Big Data and Collective IntelligenceI Internet measurementI Large-scale data analysisI exercise: introduction of Ruby scripting language
I Class 2 Data and variability (10/3)I Summary statisticsI SamplingI How to make good graphsI exercise: graph plotting by Gnuplot
I NO CLASS on 10/10I Class 3 Data recording and log analysis (10/17)
I Network management toolsI Data formatI Log analysis methodsI exercise: log data and regular expression
16 / 39
class schedule (2/5)
I Class 4 Distribution and confidence intervals (10/24)I Normal distributionI Confidence intervals and statistical testsI Distribution generationI exercise: confidence intervalsI assignment 1
I Class 5 Diversity and complexity (10/31)I Long tailI Web access and content distributionI Power-law and complex systemsI exercise: power-law analysis
I Class 6 Correlation (11/7)I Online recommendation systemsI DistanceI Correlation coefficientI exercise: correlation analysis
17 / 39
class schedule (3/5)I Class 7 Multivariate analysis (11/14)
I Data sensingI Linear regressionI Principal Component AnalysisI exercise: linear regression
I Class 8 Time-series analysis (11/22?) ***makeup classI Internet and timeI Network Time ProtocolI Time series analysisI exercise: time-series analysisI assignment 2
I Class 9 Topology and graph (11/28)I Routing protocolsI Graph theoryI exercise: shortest-path algorithm
I Class 10 Anomaly detection and machine learning (12/5)I Anomaly detectionI Machine LearningI SPAM filtering and Bayes theoremI exercise: naive Bayesian filter
18 / 39
class schedule (4/5)
I Class 11 Data Mining (12/12)I Pattern extractionI ClassificationI ClusteringI exercise: clustering
I Class 12 Search and Ranking (12/19)I Search systemsI PageRankI exercise: PageRank algorithm
I Class 13 Scalable measurement and analysis (12/26)I Distributed parallel processingI Cloud computing technologyI MapReduceI exercise: MapReduce algorithm
I Class 14 Privacy Issues (1/9)I Internet data analysis and privacy issuesI Summary of the class
19 / 39
network measurement and Internet measurement
I network measurementI measurement in limited environmentI snapshot at a time
I Internet measurementI measurement of the Internet as a large-scale open system
I large-scale distributed systemI open system (continuously changing)
20 / 39
Internet measurement – measuring unmeasurable Internet
I need for generic measurement data for the InternetI example: typical packet size distribution
I the Internet is an open system continuously changing,evolving, and expanding
I no central point, representative locations, different behaviorsare observed depending on observing location and time
I seeking for generality of the Internet: measuring unmeasurables
I for operation of the Internet, for development of protocols,equipment and services
I seeking for the best estimates, predicting the future, andrevisiting the existing knowledge
I need to consider not only from technical aspects but also fromsocial, political and economical aspects
21 / 39
importance of measurement
measurement is a basis of all technologies
I for networking, it is an attempt to observe invisible networks
I needed for operation, design, implementation, and research
I however, it has become difficult by commercialization of theInternet and widespread use
I traffic data is confidential for providers and will not bedisclosed
I risks of leaking private information
22 / 39
goals of measurement and data analysis
I operational goalsI trouble-shootingI tuning for performance and reliabilityI monitoring the usage, usage reportsI long-term planning, cost evaluation of network capacity and
equipment
I engineering goals (software, hardware, protocol design andimplementations)
I design trade-offs (e.g., buffer size and its cost)I testing and evaluationI observing unexpected behaviors (in complex systems)
I research goals (theory, modeling, new findings)I characteristics of network behaviorsI modeling (e.g., behavior of web services)I behaviors of complex systems
I abundant data and tools
I inputs for policy or investment plans
23 / 39
characteristics of network data and behavior
I skewed distributions with large varianceI inherent mechanism to make burst transferI skewed utilization: e.g., a handful users generate most traffic
I anomalies everywhereI bugs, mis-configurations, spec mismatches, accidents,
maintenance’s
I interferences among various mechanismsI e.g., congestion control: Ethernet’s collision avoidance, packet
queueing, TCP’s congestion control, capacity provisioning
I traffic aggregationI complex behavior as a whole (more than the sum of the
individual components)
I limitations of network measurementI many practical issues and limitations existI measurement affects the observed behavior
24 / 39
measurement needs combined skills
I goals could be operational, engineering, scientificI all inseparable, all skills required
I knowledge of operational environmentI engineering of measurement tools
I output can be facts, findings, new ideasI new ideas are not always necessaryI facts, especially long-term measurement, are valuable
I but you should have clear goalsI better to start with real problems to solve
I there are many issues and problems but some are moreimportant than others
25 / 39
why traffic measurement of Internet is so hard?
I massive, diverse and changing trafficI mechanisms at different layers in different time scale
I interact with each other
I dynamicsI Internet mechanisms are adaptive and resilientI traditional measurement techniques are often not applicable
I pathological traffic is not unusualI by bugs, misconfigurations, errors, mismatches, accidents
I we still don’t have good understanding
26 / 39
massive volume of traffic
I unprecedented scale with unprecedented growthI e.g., traffic volume: 1Gbps traffic
I 120MB/sec 7GB/minute 420GB/hour 9.8TB/day
I far more data than we can analyzeI techniques needed to reduce data size
I filtering: e.g., record only TCP SYN packetsI aggregation: e.g., flow-based accountingI sampling: e.g., record 1 in n packets
I also, techniques needed to reduce dimensionality
I still, details matterI a big impact often comes
I from small fractionI from minor differences
27 / 39
diverse traffic
I large variation in traffic mix between sitesI backbone vs. access links
I access line types: fiber, ADSL, modem, wireless, satelliteI differences in bandwidth, delay, loss
typical traffic doesn’t exist!
28 / 39
constant change of traffic pattern
I daily, weekly traffic patternI trend changes over time
I web in 90s and p2p in 2000s completely changed traffic pattern
I hard to predict future!
0
5
10
15
20
25
30
00:00:0004/12
03:00:0004/12
06:00:0004/12
09:00:0004/12
12:00:0004/12
15:00:0004/12
18:00:0004/12
21:00:0004/12
00:00:0004/13
Traf
fic (M
bps)
Time
dst address
total0.0.0.0/0148.65.7.36167.210.0.0/17160.0.0.0/5202.0.0.0/8
135.0.0.0/10148.65.0.0/16128.0.0.0/5167.208.0.0/12192.0.0.0/4129.13.28.0/17
135.43.0.0/17167.215.33.42129.13.0.0/17202.0.0.0/7
29 / 39
limitations of Internet measurement
I problems often occur at boundaries of different networksI cooperation needed but not easy
I measurement affects the behavior of the observed networkI need understanding and help from operators
I need to understand operational requirements and find suitablemethods for measurement
I cost: measurement doesn’t come freeI limitations to measure high-end routers with a PC
I privacy and confidential information in dataI barriers for researchers to access commercial data
30 / 39
possible topics to be studied in the class
I search ranking (PageRank), online recommendersystems(collaborative filtering)
I connections among SNS users, popular keyword extraction,shortest path search, online privacy
I SPAM filtering, MapReduce, Geolocation services, Web serverlog analysis
I Internet traffic, Internet topology
31 / 39
summary
Internet measurement and data analysis
I measurement is basis for all technologies
I for networking, it is an attempt to observe invisible networks
I need to consider not only from technical aspects but also fromsocial, political and economical aspects
theme of the class
I Internet measurement and data analysis as case studies
I learn how to measure what is difficult to measure
I learn how to extract useful information from huge data sets
32 / 39
Introduction to Ruby
Ruby
I a scripting language for object-oriented programming
I supports wide range of functions for text processing andsystem management
I free software started in 1993
I original author: Yukihiro Matsumoto
I became popular for Ruby on Rails (a web applicationframework)
Ruby informationRuby official site: http://www.ruby-lang.org/
Ruby reference manual: http://www.ruby-lang.org/en/documentation/
Ruby の歩き方: http://jp.rubyist.net/magazine/?FirstStepRuby
34 / 39
Ruby characteristicsI interpreter language: no need to compile for executionI highly portable: runs on most platformsI simple syntax
I no predefined data type for variables, variables can store anydata and are dynamically typed
I no need to declare variables, variable types (local variables,global variables, instance variables) can be inferred fromvariable names
I garbage collection: users do not need to manage memoryI object-oriented
I everything is an objectI class, inheritance, methodsI iterator and closure
I control structures and procedures can be written inobject-oriented manner
I powerful string operations/regular expressionsI built-in support for large integersI Ruby’s shortcomings: a bit slower than its competitors
35 / 39
Ruby commands
I irb: Ruby’s interactive interface$ irb --simple-prompt
>> puts "Hello"
Hello
I ruby: Ruby main program$ ruby test.rb
or,$ ruby -e ’puts "Hello".reverse’
olleH
36 / 39
exercise: a program to count text linescount the number of text lines in a file given by the argument
filename = ARGV[0]
count = 0
file = open(filename)
while text = file.gets
count += 1
end
file.close
puts count
write to “count.rb” and then run it
$ ruby count.rb foo.txt
rewrite it in a more rubyish way
#!/usr/bin/env ruby
count = 0
ARGF.each_line do |line|
count += 1
end
puts count
37 / 39
next class
Class 2 Data and variability (10/3)
I Summary statistics
I Sampling
I How to make good graphs
I exercise: graph plotting by Gnuplot
NO CLASS on 10/10
38 / 39
references
[1] Ruby official site. http://www.ruby-lang.org/
[2] gnuplot official site. http://gnuplot.info/
[3] Mark Crovella and Balachander Krishnamurthy. Internet measurement:infrastructure, traffic, and applications. Wiley, 2006.
[4] Pang-Ning Tan, Michael Steinbach and Vipin Kumar. Introduction to DataMining. Addison Wesley, 2006.
[5] Raj Jain. The art of computer systems performance analysis. Wiley, 1991.
[6] Toby Segaran. Programming Collective Intelligence. O’Reilly Media. 2007.
[7] Allen B. Downey. Think Stats: Probability and Statistics for Programmers.O’Reilly Media. 2011. http://thinkstats.com/
[8] あきみち、空閑洋平. インターネットのカタチ. オーム社, 2011.
[9] 井上洋, 野澤昌弘. 例題で学ぶ統計的方法. 創成社, 2010.
[10] 平岡和幸, 掘玄. プログラミングのための確率統計. オーム社, 2009.
39 / 39