..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.
......
Visualizing Mirrorsmirrors.ustc.edu.cn 服务器日志分析
©USTC LUG
August 14, 2012
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Outline...1 Requests & Traffic
By TimeBy IPBy Other Measures
...2 FilesFiles CharacteristicsHow Files Are Requested
...3 Sessions
...4 Distributions InsightCentOSFedoraUbuntuEclipse
...5 Technical Details
...6 Query Optimization
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Notes
The data is access log of mirrors.ustc.edu.cn in 51 days. See‘Technical Details’ section for more info about dataset.Some graphs are in log-scale for clarity. Please note whether xaxis, y axis or both are in log-scale. The graph title sometimeslies.Because there may be many points in a graph, sampling ismade to reduce file size (they are vector graphics), hencethere may be some ‘straight lines’. I have checked the data tomake sure the graphs illustrate real trends.Title length is limited, so the title itself may not explain well,please keep an eye on the axis and keys of the graph.Graphs are shown in the hope of conveying informationwithout words. Any questions or suggestions, please email me.
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Requests & Traffic in a day
0
50000
100000
150000
200000
250000
300000
350000
400000
00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 00:00 0
1e+10
2e+10
3e+10
4e+10
5e+10
Req
uest
s
Tra
ffic
(Byt
es)
Time of the day
Requests & Traffic within a day
Request count (Bezier smoothed)Traffic (Bezier smoothed)
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Requests & Traffic in a week
0
1e+07
2e+07
3e+07
4e+07
5e+07
6e+07
7e+07
Monday Tuesday Wednesday Thursday Friday Saturday Sunday 0
1e+12
2e+12
3e+12
4e+12
5e+12
6e+12
7e+12
8e+12
Req
uest
s
Tra
ffic
(Byt
es)
Time of the day
Requests & Traffic in different weekdays
Request countTraffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Requests & Traffic across 50 days
0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
05-20 05-27 06-03 06-10 06-17 06-24 07-01 07-08 07-15 0
2e+11
4e+11
6e+11
8e+11
1e+12
1.2e+12
Req
uest
s
Tra
ffic
(Byt
es)
Time of the day
Requests & Traffic in 50 days
Request countTraffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Statistics
Requests TrafficTotal 328976877 36892 GBAvg. per Day 6450527 723.4 GBMax. per Day 8632963 1049.5 GBMin. per Day 4868022 421.5 GBAvg. per Hour 268771 30.14 GBMax. per Hour 561925 79.75 GBMin. per Hour 99506 2.97 GBAvg. per Minute 4480 514.4 MBMax. per Minute 14714 N/AMin. per Minute 441 N/AAvg. per Second 74.66 8779 KBMax. per Second 2117 N/AMin. per Second 1 N/A
Because the time recorded is only completion time of the request, and large requests can span hours, soMax./Min. per minute/second is not applicable.
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cumulative Requests per Hour
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
550000
600000
0 10 20 30 40 50 60 70 80 90 100
Req
uest
s
Hour Percentage (sorted by Requests count)
Sorted Requests per Hour
Hour Percentage
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cumulative Requests per Minute
0
2000
4000
6000
8000
10000
12000
14000
16000
0 10 20 30 40 50 60 70 80 90 100
Req
uest
s
Minutes Percentage (sorted by Requests count)
Sorted Requests per Minute
Minutes Percentage
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cumulative Requests per Second
0
50
100
150
200
250
300
350
400
450
0 10 20 30 40 50 60 70 80 90 100
Req
uest
s
Seconds Percentage (sorted by Requests count)
Sorted Requests per Second
Seconds Percentage
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cumulative Traffic over IPs: 20%-80% law
0
5e+12
1e+13
1.5e+13
2e+13
2.5e+13
3e+13
3.5e+13
4e+13
1 10 100 1000 10000 100000 1e+06 1e+07
Cum
ulat
ive
Tra
ffic
Percentage of unique IP (log-scale) (sorted by Traffic DESC)
Cumulative Traffic over unique IPs
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cumulative Requests over IPs: 20%-80% law
0
5e+07
1e+08
1.5e+08
2e+08
2.5e+08
3e+08
3.5e+08
1 10 100 1000 10000 100000 1e+06 1e+07
Cum
ulat
ive
Req
uest
s
Percentage of unique IP (log-scale) (sorted by Request Num DESC)
Cumulative Requests over unique IPs
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. IPv4 vs. IPv6
Requests TrafficIPv4 318575688 (96.84%) 34180 GB (92.65%)IPv6 10401189 (3.15%) 2712 GB (7.35%)
It can be seen that IPv6 still have a long way to go…
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Requests & Traffic TOP 40: xxx.0.0.0/24
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
IPv6222
202113
21858 114
61 183180
121124
116219
210119
12259 221
125123
60 115117
118220
211203
112111
14 182110
27 1 120175
101159
223
Per
cent
age
Request & Traffic among IPv4 first fields
RequestsTraffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Traffic TOP 40: IPv4 addrs
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
0.004
0.0045
0.005
219.133.0.1
218.242.250.212
203.114.244.88
114.212.189.93
114.113.226.53
180.169.73.90
180.96.19.25
66.197.225.53
218.3.125.243
61.234.123.57
159.226.126.177
203.198.202.225
124.74.45.130
220.181.145.27
114.213.255.162
202.108.130.138
202.119.45.31
222.66.23.57
124.127.250.34
208.53.156.36
221.216.135.54
116.228.240.198
113.108.76.195
114.80.133.7
210.13.71.73
180.149.134.10
124.126.245.14
218.94.63.55
220.248.0.145
112.65.134.2
222.56.17.109
124.74.78.2
124.207.104.18
58.211.218.74
1.202.225.132
116.226.65.12
220.248.0.154
222.94.140.45
210.73.5.33
116.247.98.50
Per
cent
age
Request & Traffic among popular IPv4 addrs
RequestsTraffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Request count TOP 40: IPv4 addrs
0
0.005
0.01
0.015
0.02
0.025
0.03
63.245.214.78
209.132.181.102
113.111.38.40
129.143.116.10
211.86.56.227
223.5.20.10
49.123.105.219
159.226.20.217
60.208.111.199
119.97.142.81
202.38.95.60
203.244.218.6
221.219.75.222
202.104.151.152
121.49.96.70
59.77.33.100
182.89.199.227
183.45.54.83
218.13.224.81
59.37.44.133
59.44.42.194
222.134.53.246
210.21.243.170
61.130.247.168
116.228.202.66
123.185.172.126
219.134.89.202
114.113.29.21
183.31.242.39
58.19.126.37
27.17.19.75
203.114.244.88
220.178.52.108
124.42.77.160
113.111.40.89
218.94.63.55
180.153.97.82
222.171.60.177
210.34.196.99
222.92.29.130
Per
cent
age
Request & Traffic among popular IPv4 addrs
RequestsTraffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. USTC Mirrors Usage (IPv4 only)
IP range Requests Traffic Note
202.38.64.0-202.38.95.255 994661 (0.30%) 149.3 GB (0.40%) CERNET
210.45.64.0-210.45.79.255 191141 (0.06%) 68.31 GB (0.19%) CERNET
210.45.112.0-210.45.127.255 243976 (0.07%) 37.59 GB (0.10%) CERNET
211.86.144.0-211.86.159.255 81035 (0.02%) 24.96 GB (0.07%) CERNET
222.195.64.0-222.195.95.255 319435 (0.10%) 86.88 GB (0.24%) CERNET
114.214.160.0-114.214.255.255 0 0 CERNET
210.72.22.0-210.72.22.255 3622 (0.00%) 11.86 MB (0.00%) TechNet (?)
218.22.21.0-218.22.21.31 1 (0.00%) 0.01 MB (0.00%) China Telecom
218.104.71.160-218.104.71.175 0 0 China Unicom
202.141.160.0-202.141.175.255 123455 (0.04%) 12.60 GB (0.03%) China Telecom
202.141.176.0-202.141.191.255 187 (0.00%) 120.4 MB (0.00%) China Mobile
Total 1957513 (0.60%) 379.77 GB (1.03%) USTC IPv4
Data source of USTC IP range: http://lib.ustc.edu.cn/ustcip.html
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Requests & Traffic of distributions
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
eclipse
fedora
ubuntu
centos
debian
tdfcygwin
CTANarchlinux
mozilla-current
opensuse
gentoo
kde-applicationdata
kdebacktrack
epelgnu
CRAN
NULLfreebsd
ubuntu-releases
debian-security
scientificlinux
linux-kernel
debian-backports
kdemod
debian-cd
meego
slackware
deepin
sourceware.org
CPAN
linuxmint
linux-2.6.git
puppy
linux.git
debian-multim
edia
4 qomo3.7
Per
cent
age
Request & Traffic of Distributions
RequestsTraffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Requests & Traffic of distributions
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
centos
NULLeclipse
fedora
ubuntu
ubuntu-releases
mozilla-current
tdfdebian
CTANbacktrack
opensuse
gentoo
deepin-cd
kde-applicationdata
debian-cd
cygwin
Ubuntu
linuxmint
archlinux
kdegnu
linuxmint-cd
CRAN
debian-multim
edia
qomodebian-security
epelpuppy
scientificlinux
debian-backports
freebsd
deepin
CPAN
kdemod
turnkeylinux
slackware
debian-uo
gentoo-portage
linux-kernel
Per
cent
age
Request & Traffic of Distributions
RequestsTraffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Requests & Traffic among HTTP Status Codes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
200 206 301 304 400 403 404 405 408 416 499 500 502
Per
cent
age
HTTP Status Code
Request & Traffic among HTTP status codes
RequestsTraffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Requests & Traffic among User Agents
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Mozilla
urlgrabber
Debian
Wget
Jakarta
NULLFedora
Preupgrade
Ubuntu
pacman
libwww
texlive
Cygwin
jigdoOpera
ZYppAxel
CentOS
MPM
FDMlftp NSIS
Javaanaconda
curlEclipse
aria2Python
BTWebClient
R
Per
cent
age
User Agent (Only largest 30 are shown)
Request & Traffic of User Agents
RequestsTraffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Traffic per Request order by Length
1
10
100
1000
10000
100000
1e+06
1e+07
1e+08
1e+09
1e+10
5e+07 1e+08 1.5e+08 2e+08 2.5e+08 3e+08 3.5e+08
Req
uest
Len
gth
(Log
-sca
le)
Request Num
Request Num sorted by Request Length
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cumulative Traffic sorted by Request Length
0
5e+12
1e+13
1.5e+13
2e+13
2.5e+13
3e+13
3.5e+13
4e+13
100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10
Cum
ulat
ive
Tra
ffic
Request Length (log-scale)
Cumulative Request Traffic sorted by Request Length
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Request Num sorted by Request Length
0
2e+07
4e+07
6e+07
8e+07
1e+08
1.2e+08
1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09
Req
uest
Num
Request Length (Log-scale)
Request Num sorted by Request Length
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Outline...1 Requests & Traffic
By TimeBy IPBy Other Measures
...2 FilesFiles CharacteristicsHow Files Are Requested
...3 Sessions
...4 Distributions InsightCentOSFedoraUbuntuEclipse
...5 Technical Details
...6 Query Optimization
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. File Number at different FileSizes (log scale)
0
5000
10000
15000
20000
25000
1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10
File
Num
FileSize (Log-scale)
Sorted Filenum by Filesize
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Total Size at different Filesizes (log scale)
0
2e+09
4e+09
6e+09
8e+09
1e+10
1.2e+10
1.4e+10
1.6e+10
1.8e+10
2e+10
100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10
File
Siz
e *
File
num
FileSize
Total Size at different Filesizes (normal scale)
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cumulative Filesize per File Num (normal scale)
0
2e+12
4e+12
6e+12
8e+12
1e+13
1.2e+13
1.4e+13
0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 1e+07 1.1e+07
File
Siz
e
FileNum
Accumulated FileSize by FileNum sorted by Size (normal scale)
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cumulative Filesize per File Num (log scale)
1e+07
1e+08
1e+09
1e+10
1e+11
1e+12
1e+13
1e+14
0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 1e+07 1.1e+07
File
Siz
e
FileNum
Accumulated FileSize by FileNum sorted by Size (log scale y)
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cumulative Filesize per FileSize (normal scale)
0
2e+12
4e+12
6e+12
8e+12
1e+13
1.2e+13
1.4e+13
0 1e+09 2e+09 3e+09 4e+09 5e+09 6e+09 7e+09
Acc
umul
ated
File
Siz
e
FileSize
Accumulated Filesize by Filesize
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cumulative Filesize per FileSize (log scale)
0
2e+12
4e+12
6e+12
8e+12
1e+13
1.2e+13
1.4e+13
1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10
Acc
umul
ated
File
Siz
e
FileSize (Log-scale)
Accumulated Filesize by Filesize
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. File Extensions TOP 40 (order by Total Size)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
tbz
rpm
iso
deb
gz bz2
xz zip
drpm
img
tgz
mar
jar
png
ogg
exe
txz
dmg
tar
pet
sfs
run
7z udeb
lzma
pkg
ogv
xpi
tbz2
cb m4v
msi
1 log
2 jpg
92 3 wz
Per
cent
ageFileNum and Filesize of popular file extensions
Total NumTotal Size
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. File Extensions TOP 40 (order by File Num)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
tbz
rpm
deb
gz png
drpm
readme
bz2
jar
dsc
hdr
xz zip
tgz
meta
c,v
sig
log
tfm txt
ebuild
jpg
asc
xml
changes
udeb
md5
h,v
xpi
tex
patch
html
sign
in,v
news
ltx sha1
txz
0
Per
cent
ageFileNum and Filesize of popular file extensions
Total NumTotal Size
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. How many files share a same filename (extension excluded)
0.0001
0.001
0.01
0.1
1
1 10 100 1000 10000 100000 1e+06
Per
cent
age
(logs
cale
)
Number of Files sharing a same FileName (logscale)
FileNum and FileSize of Filenames with Different Number of Shares
Total NumTotal Size
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. How many files share a same filename and extension
0.0001
0.001
0.01
0.1
1
1 10 100 1000 10000 100000 1e+06
Per
cent
age
(logs
cale
)
Number of Files sharing a same FileName and Extension (logscale)
FileNum and FileSize of Filename swith Different Number of Shares
Total NumTotal Size
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Num & Size of Files in each Distribution (sorted by Size)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
freebsd
fedora
debian-cd
scientificlinux
debian
ubuntu
opensuse
meego
gnome
gentoo
backtrack
centos
epeleclipse
slackware
ubuntu-releases
kdelinux-kernel
kde-applicationdata
linuxmint-cd
archlinux
mozilla-current
bin deepin-cd
turnkeylinux
qomokdem
od
CTANprogress-linux
puppy
gnudebian-backports
knoppix-dvd
debian-security
tdfsrc
cygwin
loongson2f
deepin
linuxmint
Per
cent
age
Num & Size of Files in each Distribution
NumberSize
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Num & Size of Files in each Distribution (sorted by Num)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
freebsd
fedora
kde-applicationdata
debian
ubuntu
opensuse
modules
eclipse
CTANauthors
epelgnom
e
gentoo-portage
slackware
backtrack
meego
scientificlinux
bin centos
gentoo
archlinux
kdeweb
srcqom
odebian-backports
macports
Xorglinux-kernel
cygwin
gnum
ozilla-current
progress-linux
debian-security
debian-cd
kdemod
linuxmint
debian-multim
edia
puppy
tdf
Per
cent
age
Num & Size of Files in each Distribution
NumberSize
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cumulative Requests over FileSize order by Size
0
2e+07
4e+07
6e+07
8e+07
1e+08
1.2e+08
1.4e+08
1.6e+08
1.8e+08
2e+08
0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07
Acc
umul
ated
Req
uest
s C
ount
FileSize
Cumulative Requests Num per File order by Filesize
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cumulative Traffic over non-cumu. FileSize
0
5e+12
1e+13
1.5e+13
2e+13
2.5e+13
0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 1e+07 1.1e+07
Acc
umul
ated
Tra
ffic
FileSize
Cumulative Traffic over non-cumulated Filesize
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cumulative Traffic over non-cumu. Filesize (log-scale)
1e+09
1e+10
1e+11
1e+12
1e+13
1e+14
0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 1e+07 1.1e+07
Cum
ulat
ed T
raffi
c (lo
g-sc
ale)
FileSize
Cumulative Traffic over non-cumulated FileSize
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cache 1: Cumu. Traffic over FileSize order by Size
0
5e+12
1e+13
1.5e+13
2e+13
2.5e+13
0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13
Cum
ulat
ive
Tra
ffic
Cumulative FileSize (order by FileSize)
Cumulative Traffic over Cumulative FileSize
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cache 2: Cumu. Traffic over FileSize order by Size DESC
0
5e+12
1e+13
1.5e+13
2e+13
2.5e+13
0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13
Cum
ulat
ive
Tra
ffic
Cumulative FileSize (order by FileSize DESC)
Cumulative Traffic over Cumulative FileSize
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cache 3: Cumu. Traffic over FileSize order by Req Num
0
5e+12
1e+13
1.5e+13
2e+13
2.5e+13
0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13
Cum
ulat
ive
Tra
ffic
Cumulative FileSize (order by request num DESC)
Cumulative Traffic over Cumulative FileSize
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cache 4: Cumu. Traffic over FileSize order by Traffic
6e+12
8e+12
1e+13
1.2e+13
1.4e+13
1.6e+13
1.8e+13
2e+13
2.2e+13
2.4e+13
2.6e+13
0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13
Cum
ulat
ive
Tra
ffic
Cumulative FileSize (order by traffic of this file DESC)
Cumulative Traffic over Cumulative FileSize
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cache 5: Cumu. Traffic order by Traffic/FileSize
0
5e+12
1e+13
1.5e+13
2e+13
2.5e+13
0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13
Cum
ulat
ive
Tra
ffic
Cumulative FileSize (order by traffic/filesize DESC)
Cumulative Traffic over Cumulative FileSize
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Comparison of the Previous five ‘Caching Policies’
0
5e+12
1e+13
1.5e+13
2e+13
2.5e+13
0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13
Cum
ulat
ive
Tra
ffic
Cumulative FileSize order by different metrics
Cumulative Traffic over Cumulative FileSize
FileSizeFileSize DESC
Requests Num DESCTraffic DESC
Traffic/FileSize DESC
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Comparison of ‘Caching Policies’ (log-scale)
0
5e+12
1e+13
1.5e+13
2e+13
2.5e+13
100 10000 1e+06 1e+08 1e+10 1e+12 1e+14
Cum
ulat
ive
Tra
ffic
Cumulative FileSize order by different metrics (log-scale)
Cumulative Traffic over Cumulative FileSize
FileSizeFileSize DESC
Requests Num DESCTraffic DESC
Traffic/FileSize DESC
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Comparison of ‘Caching Policies’
Among these static caching policies, caching files with mosttraffic or requests is acceptable.Caching files that carried most traffic in history has a goodperformance. A 10GB cache of 85 files can cover 40% of thetotal traffic. If cache size continue to increase, the cacheefficiency will deteriorate, since x axis of the graph is inlog-scale.Caching files with largest traffic/filesize ratio shows bestperformance (by definition). A 10GB cache of 7162 files cancover 58% of the total traffic.
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Details of Most Traffic Caching (log-scale)
0
0.2
0.4
0.6
0.8
1
1e+08 1e+09 1e+10 1e+11 1e+12 1e+13 1e+14
Cum
ulat
ive
Rat
io
Cumulative FileSize (log-scale) (order by traffic of this file DESC)
Cumulative Ratio over Cumulative FileSize
Cumulative Traffic (\%)Cumulative Requests (\%)Cumulative File Num (\%)
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Details of Traffic/FileSize Caching (log-scale)
0
0.2
0.4
0.6
0.8
1
1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10 1e+11 1e+12 1e+13 1e+14
Cum
ulat
ive
Rat
io
Cumulative FileSize (log-scale) (order by Traffic/FileSize of this file DESC)
Cumulative Ratio over Cumulative FileSize
Cumulative Traffic (\%)Cumulative Requests (\%)Cumulative File Num (\%)
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. An Alternate Metric: Request Hits
0
2e+07
4e+07
6e+07
8e+07
1e+08
1.2e+08
1.4e+08
1.6e+08
1.8e+08
2e+08
0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13
Cum
ulat
ive
Req
uest
s
Cumulative FileSize order by different metrics
Cumulative Requests over Cumulative FileSize
FileSizeFileSize DESC
Requests Num DESCTraffic DESC
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. An Alternate Metric: Request Hits
0
2e+07
4e+07
6e+07
8e+07
1e+08
1.2e+08
1.4e+08
1.6e+08
1.8e+08
2e+08
100 10000 1e+06 1e+08 1e+10 1e+12 1e+14
Cum
ulat
ive
Req
uest
s
Cumulative FileSize order by different metrics (log-scale)
Cumulative Requests over Cumulative FileSize
FileSizeFileSize DESC
Requests Num DESCTraffic DESC
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Num & Size of Never-accessed Files (% of total)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
freebsd
fedora
kde-applicationdata
debian
ubuntu
modules
opensuse
eclipse
authors
epelgnom
e
gentoo-portage
CTANslackware
meego
backtrack
scientificlinux
bin gentoo
archlinux
kdeweb
srcqom
om
acports
centos
debian-backports
Xorglinux-kernel
gnum
ozilla-current
progress-linux
debian-cd
debian-security
cygwin
kdemod
linuxmint
debian-multim
edia
tdfloongson2f
Per
cent
age
of T
otal
Num & Size of Never-Accessed Files in each Distribution
NumberSize
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Num & Size of Never-accessed Files (% of total)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
freebsd
fedora
debian-cd
scientificlinux
debian
opensuse
meego
ubuntu
gnome
gentoo
epelslackware
backtrack
linux-kernel
kdeeclipse
kde-applicationdata
ubuntu-releases
mozilla-current
bin archlinux
centos
linuxmint-cd
turnkeylinux
qomokdem
od
progress-linux
deepin-cd
knoppix-dvd
CTANdebian-backports
gnusrc
tdfpuppy
loongson2f
debian-security
linuxmint
authors
deepin
Per
cent
age
of T
otal
Num & Size of Never-Accessed Files in each Distribution
NumberSize
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Num & Size of Never-accessed Files (% of distribution)
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
clpahtm
lm
odules
m odules
authors
bin srcweb
mozilla-current
docqom
oscripts
freebsd
meego
knoppix-dvd
portscontrib
kde-applicationdata
linux-kernel
scientificlinux
knoppix
debian-cd
tdfgnom
e
Xorgslackware
turnkeylinux
backtrack
debian-volatile
eclipse
epeldotdeb
progress-linux
macports
gentoo-portage
misc
indices
kdegentoo
debian-multim
edia
Per
cent
age
of D
istr
ibut
ion
Num & Size of Never-Accessed Files in each Distribution
NumberSize
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Num & Size of Never-accessed Files (% of distribution)
0.75
0.8
0.85
0.9
0.95
1
knoppix-dvd
mozilla-current
htmlcontrib
bin src m odules
modules
indices
docclpa
authors
portsdotdeb
webdebian-volatile
meego
freebsd
slackware
scripts
misc
Xorggnom
e
epelprogress-linux
qomoscientificlinux
knoppix
debian-cd
turnkeylinux
debian-multim
edia
gentoo-portage
linuxmint
loongson2f
kdemod
linux-kernel
deepin
tdfkde-applicationdata
kde
Per
cent
age
of D
istr
ibut
ion
Num & Size of Never-Accessed Files in each Distribution
NumberSize
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Num & Size of Ever-accessed Files (% of distribution)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
finkCRAN
CPAN
helpcygwin
centos
puppy
ubuntu
deepin-cd
mirm
on
debian
ubuntu-releases
gnulinuxm
int-cd
CTANdebian-security
deepin
uksm-kernel
fedora
kdemod
opensuse
debian-backports
loongson2f
linuxmint
archlinux
debian-multim
edia
gentoo
kdeindices
misc
gentoo-portage
macports
progress-linux
dotdeb
epeleclipse
debian-volatile
backtrack
turnkeylinux
slackware
Per
cent
age
of D
istr
ibut
ion
Num & Size of Ever-Accessed Files in each Distribution
NumberSize
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Num & Size of Ever-accessed Files (% of distribution)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
centos
cygwin
deepin-cd
puppy
CTANeclipse
ubuntu
mirm
on
linuxmint-cd
macports
ubuntu-releases
gnudebian-security
debian
fedora
archlinux
backtrack
opensuse
debian-backports
uksm-kernel
gentoo
kdekde-applicationdata
tdfdeepin
linux-kernel
kdemod
loongson2f
linuxmint
gentoo-portage
debian-multim
edia
turnkeylinux
debian-cd
knoppix
scientificlinux
qomoprogress-linux
epelgnom
e
Xorg
Per
cent
age
of D
istr
ibut
ion
Num & Size of Ever-Accessed Files in each Distribution
NumberSize
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Outline...1 Requests & Traffic
By TimeBy IPBy Other Measures
...2 FilesFiles CharacteristicsHow Files Are Requested
...3 Sessions
...4 Distributions InsightCentOSFedoraUbuntuEclipse
...5 Technical Details
...6 Query Optimization
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Discovering Sessions
.Definition..
......
Two requests are within a same Interval Session iff:Have same IP addressThe time difference does not exceed some limit
.Definition..
......
A Gap Session is a longest sequence of requests where:All requests are from the same IP addressTime difference of every two adjacent requests do not exceedsome limit
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Discovering Sessions (continued)
Since Mirrors is a resource-downloading site, many sessionsdownload many files for hours. The time limit of IntervalSession is 60 minutes.The time limit of Gap Session is 30 minutes. For longdownloads that extend over 30 minutes, the session will bebroken into two.Both algorithms suffer from false positives and true negatives.Some IPs access so frequently that the gap session never ends(see the next slide), making the Gap Session data highlybiased.
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Average Session Duration over IP (log-scale)
0.001
0.01
0.1
1
10
100
1000
10000
100000
1e+06
1e+07
1 10 100 1000 10000 100000 1e+06 1e+07
Ave
rage
Dur
atio
n (lo
g-sc
ale)
Percentage of unique IP (log-scale) (sorted by Duration DESC)
Average Duration over Unique IPs
Gap SessionInterval Session
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Average Session Duration over IP (normal-scale)
0.001
0.01
0.1
1
10
100
1000
10000
100000
1e+06
1e+07
0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06 1.6e+06
Ave
rage
Dur
atio
n (lo
g-sc
ale)
Percentage of unique IP (sorted by Duration DESC)
Average Duration over Unique IPs
Gap SessionInterval Session
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Gap Session Statistics in a day
0
0.0002
0.0004
0.0006
0.0008
0.001
0.0012
0.0014
0.0016
00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 00:00
Rat
io o
f tot
al
Time in a day
Session statistics at different Time in a day
Session count (Bezier smoothed)Request count (Bezier smoothed)
Duration (Bezier smoothed)Traffic (Bezier smoothed)
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Explanation
How strange the three graphs look! Durations are alwaysintegers (seconds), so there are straight lines in durationgraphs.Many sessions are actually ‘single’ requests, so their durationis zero, and the amount of them can be seen in normal-scalegraph.About 40 sessions extend throughout the whole 51 days, andmore sessions extend less days. The amount of them is thelength of red horizontal line in log-scale graph.Because the log starts at May 22 06:25:37, requests in theselong-live Gap Sessions accumulate to a horrible peak at 06:27in Gap Session statistics (much higher than shown, since theoriginal data is noizy, and the graph is Bezier-smoothed).
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cumulative Over Long-live Gap Sessions
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Cum
ulat
ive
Rat
io o
f Tot
al
Sessions order by Duration DESC
Cumulative Ratio of Requests, Duration and Traffic over Sessions
Request countDuration
Traffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Ratio to Average for Long-live Gap Sessions (log-scale)
0.0001
0.001
0.01
0.1
1
10
100
1000
10000
100000
1e+06
1e+07
1 10 100 1000 10000
Rat
io to
Ave
rage
(Lo
g-sc
ale)
Sessions order by Duration DESC
Requests, Duration and Traffic over Sessions
Request countDuration
Traffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Statistics of Interval Sessions
Total Average Max Min Std. DevRequests 328976877 7.5020 186043 1 173.1824Traffic 36277 GB 867.46 KB Overflow 0 15.227 MBDuration 4.026 ∗ 1010 918.13 s 3599 s -6 s 1502.9 s
43852053 sessions in total.Sessions with negative duration is because log items canaccidentally be not in time non-decreasing order. I’m notgoing to fix it, since these 230 wrong sessions have littleinfluence.
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Statistics of Gap Sessions
Total Average Max Min Std. DevRequests 328976877 6.6200 8432330 1 1255.9488Traffic 35291 GB 744.67 KB Overflow 0 15.184 MBDuration 7.666 ∗ 109 154.27 s 4406403 -27 s 7060.6 s
49694582 sessions in total.For comparison with Interval Sessions.The following statistics are based on Interval Session if notnoted specially.
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cumulatives over Sessions order by Traffic DESC
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5e+06 1e+07 1.5e+07 2e+07 2.5e+07 3e+07 3.5e+07 4e+07 4.5e+07
Cum
ulat
ive
stat
Number of Sessions (order by traffic DESC)
Cumulative stat over Sessions order by Traffic DESC
Requests countDuration
Traffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cumulatives over Sessions order by Traffic DESC
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
100 1000 10000 100000 1e+06 1e+07 1e+08
Cum
ulat
ive
stat
Number of Sessions (log-scale) (order by traffic DESC)
Cumulative stat over Sessions order by Traffic DESC
Requests countDuration
Traffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cumulatives over Sessions order by Requests DESC
0
0.2
0.4
0.6
0.8
1
1 10 100 1000 10000 100000 1e+06 1e+07 1e+08
Cum
ulat
ive
stat
Number of Sessions (log-scale) (order by Requests DESC)
Cumulative stat over Sessions order by Requests DESC
Requests countDuration
Traffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cumulatives over Sessions order by Duration DESC
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 10 100 1000 10000 100000 1e+06 1e+07 1e+08
Cum
ulat
ive
stat
Number of Sessions (log-scale) (order by Duration DESC)
Cumulative stat over Sessions order by duration DESC
Requests countDuration
Traffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Cumulative Session Duration over Unique IPs
0
5e+09
1e+10
1.5e+10
2e+10
2.5e+10
3e+10
3.5e+10
4e+10
1 10 100 1000 10000 100000 1e+06 1e+07
Cum
ulat
ive
Dur
atio
n
Percentage of unique IP (log-scale) (sorted by Duration DESC)
Cumulative Interval Session Duration over Unique IPs
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Sessions in a day
0
0.0002
0.0004
0.0006
0.0008
0.001
0.0012
0.0014
0.0016
00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 00:00
Rat
io o
f tot
al
Time in a day
Session statistics at different Time in a day
Session count (Bezier smoothed)Request count (Bezier smoothed)
Duration (Bezier smoothed)Traffic (Bezier smoothed)
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Sessions across days
0
0.005
0.01
0.015
0.02
0.025
0.03
05-20 05-27 06-03 06-10 06-17 06-24 07-01 07-08 07-15
Rat
io o
f tot
al
Date
Session statistics across days
Session countRequest count
DurationTraffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Sessions in Distributions order by Traffic
0
0.1
0.2
0.3
0.4
0.5
0.6
eclipse
fedora
centos
ubuntu
debian
tdfcygwin
CTANarchlinux
mozilla-current
opensuse
gentoo
kde-applicationdata
Ubuntu
kdebacktrack
ubuntu-releases
gnuCRAN
NULL
Per
cent
ageSessions in each Distribution (order by Traffic)
NumberRequestsDuration
Traffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Sessions in Distributions order by Requests
0
0.1
0.2
0.3
0.4
0.5
0.6
centos
NULLeclipse
fedora
ubuntu
ubuntu-releases
mozilla-current
tdfdebian
CTANbacktrack
opensuse
Ubuntu
deepin-cd
kde-applicationdata
linuxmint
debian-cd
cygwin
archlinux
kde
Per
cent
age
Sessions in each Distribution (order by Requests)
NumberRequestsDuration
Traffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Sessions in Distributions order by Session Num
0
0.1
0.2
0.3
0.4
0.5
0.6
NULLcentos
fedora
eclipse
ubuntu
mozilla-current
debian
ubuntu-releases
kde-applicationdata
opensuse
epeltdf
archlinux
CTANcygwin
CRAN
backtrack
deepin-cd
gnukde
Per
cent
age
Sessions in each Distribution (order by Session Num)
NumberRequestsDuration
Traffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Sessions in Distributions order by Duration
0
0.1
0.2
0.3
0.4
0.5
0.6
centos
NULLfedora
eclipse
ubuntu
mozilla-current
debian
ubuntu-releases
epeltdf
archlinux
opensuse
CTANcygwin
kde-applicationdata
backtrack
gnudeepin-cd
kdegentoo
Per
cent
age
Sessions in each Distribution (order by Duration)
NumberRequestsDuration
Traffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Sessions and User Agents
0
0.1
0.2
0.3
0.4
0.5
0.6
NULLurlgrabber
Mozilla
Jakarta
Debian
Ubuntu
ZYppJava
pacman
Wget
Python
Eclipse
curllibwww
Homebrew
Opera
Fedora
Preupgrade
MirrorBrain
Sosospider
Per
cent
age
Sessions in each User Agent (order by Session Num DESC)
NumberRequestsDuration
Traffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Outline...1 Requests & Traffic
By TimeBy IPBy Other Measures
...2 FilesFiles CharacteristicsHow Files Are Requested
...3 Sessions
...4 Distributions InsightCentOSFedoraUbuntuEclipse
...5 Technical Details
...6 Query Optimization
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. CentOS Basic Stat
Value % of total RankRequests 100986986 30.7% 1Traffic 5252.5 GB 14.2% 4Files 70043 0.7% 19FileSize 172.8 GB 1.3% 12Sessions 19452260 44.4% 1
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. CentOS Basic Stat - Average
Average Ratio Std.Dev RatioRequest Length 55846 0.464 882920 0.289File Size 2665431 1.953 54320911 2.294File Requests 1402.46 76.994 92327 11.029File Traffic 74454404 32.12 1231957014 1.784Session Duration 1051.19 1.145 1586.69 1.056Session Requests 5.796 0.773 137.5 0.794Session Traffic 336791 0.379 8931320 0.559
Std.Dev stands for Standard Deviation.‘Ratio’ in the third col stands for Ratio of DistributionAverage to Global Average; ‘Ratio’ in the fifth col stands forRatio of DIst Std.Dev to Global.Size & Traffic are in bytes, Duration is in seconds.
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Requests & Traffic among CentOS Versions
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
5.86.2
5 6.36 build
4.45.1
3.53.4
3.33.1
2.16Server
dostools
5.04.7
4.54.0
6.14.6
3 graphics
4.92 %
E6%96%
87%E7%
AB%A0%
E5%87%
BA%E5%
A4%84%
EF%BC%
9A%E9%
A
4.8HEADER.im
ages
4.23.9
5.3NULL
6.05.2
5Server
5.74 5.6
5.4%
24releasever
Per
cent
ageRequest & Traffic of Subdirectories in centos
RequestsTraffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Requests & Traffic among CentOS 2-level Subdirs
0
0.05
0.1
0.15
0.2
0.25
0.3
5.8/updates
6.2/updates
5.8/os
6.2/os
5/updates
6.3/os
5/os6/os
6/updates
5.8/extras
6.3/updates
5/extras
6.2/centosplus
5/addons
6.2/isos
5.8/centosplus
5.8/isos
5/centosplus
6/centosplus
6.3/isos
6.3/centosplus
6.2/extras
NULL5.7/isos
5/contrib
6/isos
6/extras
6.0/isos
NULL5.8/addons
5/isos
6.3/extras
5.7/updates
5.7/os
5.7/extras
6.2/fasttrack
6/fasttrack
5/fasttrack
5.8/fasttrack
5.5/os
Per
cent
ageRequest & Traffic of Subdirectories in centos
RequestsTraffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Fedora Basic Stat
Value % of total RankRequests 36329528 11.0% 4Traffic 7509.2 GB 20.3% 2Files 924620 8.8% 2FileSize 1207.8 GB 9.4% 2Sessions 3596727 8.2% 2
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Fedora Basic Stat - Average
Average Ratio Std.Dev RatioRequest Length 221945 1.843 2277229 0.744File Size 1407150 1.031 21376814 0.903File Requests 28.3134 1.554 6563 0.784File Traffic 3601714 1.554 190851795 0.276Session Duration 1226 1.336 1598 1.063Session Requests 10.8047 1.440 251.43 1.452Session Traffic 2238126 2.520 21424912 1.342
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Requests & Traffic among Fedora Subdirs
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
linuxepel
rpmfusion
webhuangou.php?id=1
js xampp
Jingdian
phpcms
Examples
abc-abc-abc-$%7Bprint(m
d5(base64d ecode(M
zYwd2Vic
?s=abc
abc,abc,abc,$%7Bprint(m
d5(base64d ecode(M
zYwd2Vic
mem
ber
managem
ent
index.php?-dautop repend
f ile%3d
toolsphpsso
s erver
go.php?a=
5Client
testing
5Server
betafedora-updates-USTC.m
irrors4.repo%20-O%
20
fedora-updates-USTC.mirrors6.repo%
20-O%20
repodata
e%5Bel
fedora-USTC.mirrors4.repo%
20-O%20
apelpub
NULLreleases
4 epeel
4ASdevelopm
ent
updates
6 5
Per
cent
ageRequest & Traffic of Subdirectories in fedora
RequestsTraffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Requests & Traffic among Fedora 2-level Subdirs
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
linux/updates
epel/5
linux/development
linux/releases
epel/6
rpmfusion/free
rpmfusion/nonfree
epel/5Server
epel/testing
NULLepel/4
NULLepel/beta
epel/4WS
epel/5Client
epel/4AS
epel/4ES
NULLreleases/12
linux/core
NULLlinux/extras
NULLreleases/17
NULLNULL
epel/x866 4
NULLlinux/s
releases/test
NULLNULL
NULLlinux/developem
ent
NULLNULL
NULLNULL
NULLNULL
Per
cent
ageRequest & Traffic of Subdirectories in fedora
RequestsTraffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Ubuntu Basic Stat
Value % of total RankRequests 19193451 5.8% 5Traffic 5653.1 GB 15.3% 3Files 608361 5.8% 5FileSize 584.6 GB 4.6% 6Sessions 160800 0.4% 4
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Ubuntu Basic Stat - Average
Average Ratio Std.Dev RatioRequest Length 316251 2.626 2950601 0.964File Size 1100038 0.806 8926254 0.377File Requests 22.2714 1.222 860.6907 0.103File Traffic 8152155 3.517 551019922 0.798Session Duration 636.37 0.693 1053.94 0.701Session Requests 110.93 14.788 294.68 1.702Session Traffic 34260641 38.570 101247464 6.341
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Requests & Traffic among Ubuntu Subdirs
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
pooldists
NULLubuntu
intrepid-security
intrepid-updates
%20intrepid-
%20natty
archive
ubuntu-cn
%20gutsy%
20main%
20multiverse%
20restricted%20univer
nattyjaunty
%20gutsy%
20main%
20
linuxm
averick
%20precise
repodata
feisty-backports
feisty-proposed
feisty-security
feisty-updates
%20natty-proposed
%20natty-security
intrepid-proposed
intrepid-backports
intrepid
updates
neverexistsa
karmic-updates
lucid-updates
lucid-security
lucid-proposed
lucid-backports
lucid%
20gutsy%20m
ain%20m
ultiverse%20restricted%
20univer
feistyprecise-updates
precise-security
precise-proposed
Per
cent
ageRequest & Traffic of Subdirectories in ubuntu
RequestsTraffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Requests & Traffic among Ubuntu 2-level Subdirs
0
0.1
0.2
0.3
0.4
0.5
0.6
pool/main
pool/universe
dists/precise
dists/lucid
dists/oneiric
dists/natty
pool/restricted
pool/multiverse
dists/hardy
dists/maverick
dists/precise-updates
dists/lucid-updates
dists/quantal
dists/oneiric-updates
dists/lucid-security
dists/precise-security
dists/precise-proposed
dists/natty-updates
NULLdists/oneiric-security
dists/natty-security
NULLdists/hardy-updates
dists/maverick-updates
dists/lucid-proposed
dists/precise-backports
dists/oneiric-proposed
dists/hardy-security
dists/lucid-backports
dists/maverick-security
dists/natty-proposed
dists/hardy-backports
dists/oneiric-backports
ubuntu/pool
dists/maverick-proposed
ubuntu/dists
dists/natty-backports
NULLdists/hardy-proposed
dists/maverick-backports
Per
cent
ageRequest & Traffic of Subdirectories in ubuntu
RequestsTraffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Eclipse Basic Stat
Value % of total RankRequests 51279473 15.6% 3Traffic 11498.6 GB 31.2% 1Files 263355 2.5% 8FileSize 161.1 GB 1.3% 14Sessions 524586 1.2% 3
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Eclipse Basic Stat - Average
Average Ratio Std.Dev RatioRequest Length 240769 2.000 5813562 1.901File Size 700002 0.513 8457323 0.357File Requests 113.5340 6.233 17894.2 2.137File Traffic 28156070 12.148 4197033259 6.079Session Duration 673.17 0.733 1059.08 0.704Session Requests 97.1089 12.944 782.80 4.520Session Traffic 22130605 24.914 62746610 3.930
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Requests & Traffic among Eclipse Subdirs
0
0.1
0.2
0.3
0.4
0.5
0.6
technology
eclipse
releases
toolsm
ylyn
webtools
archbirt
mat
modeling
windowbuilder
rt hudson
tptpkoneki
edtvirgo
datatools
e4 equinox
orionjetty
tm managem
ent
facetscout
eclipse.org-comm
on
m2e-wtp
java-ee-config
gyrexgraphiti
bpmn2-m
odeler
jpa stemm
pcegf
stp ztimegem
ini
NULL
Per
cent
ageRequest & Traffic of Subdirectories in eclipse
RequestsTraffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Requests & Traffic among Eclipse 2-level Subdirs
0
0.1
0.2
0.3
0.4
0.5
0.6
technology/epp
eclipse/downloads
releases/indigo
eclipse/updates
releases/juno
tools/cdt
mylyn/drops
webtools/downloads
NULLreleases/helios
birt/downloads
releases/galileo
mat/1.1.1
windowbuilder/WB
mat/1.2.0
modeling/tm
f
modeling/em
f
releases/ganymede
hudson/war
modeling/m
dt
edt/releases
technology/subversive
tptp/4.7.2
virgo/release
koneki/products
birt/update-site
rt/eclipselink
tools/aspectj
tools/gef
equinox/drops
datatools/updates
datatools/downloads
e4/sdk
orion/drops
tools/pdt
rt/rapwebtools/updates
technology/epf
e4/downloads
tools/orbit
Per
cent
ageRequest & Traffic of Subdirectories in eclipse
RequestsTraffic
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Outline...1 Requests & Traffic
By TimeBy IPBy Other Measures
...2 FilesFiles CharacteristicsHow Files Are Requested
...3 Sessions
...4 Distributions InsightCentOSFedoraUbuntuEclipse
...5 Technical Details
...6 Query Optimization
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Data Source
Nginx access log of mirrors.ustc.edu.cnFrom 2012-05-22 to 2012-07-12, 51 days4041MB compressed, 62383MB decompressedThanks Guo JiaHua for providing data
File list of mirrors.ustc.edu.cn crawled by spiderFTP for CPAN and CRANHTTP other directoriesNeed to detect symlinks to parent dir (e.g./ubuntu/ubuntu/…)
Scripts are written in bash and PHP.All scripts are available at GitHub:https://github.com/bojieli/mirrors-log
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Saving Data in BRIGHTHOUSE
Scanning through such amount of data is time-consuming.And the data is not too large to fit in a relational database.I tried InfoBright and InfiniDB with 6.4 ∗ 106 rows of artificialdata, InfiniDB is faster in queries, while InfoBright takes muchless disk space.
InfoBright’s compress rate is no less than gzip: 4316MB tablesize, compared to 4041MB gzip, not to mention that I haveadded some additional rows for faster statistics.I do not have much disk space, so I choose InfoBright (abackend of MySQL).Most of the queries I used (mostly GROUP BY, WHERE) takeless than 2 minutes.
Create two FIFOs foreach log: zcat logfile > php(preprocessing) > mysql-ib LOAD DATA INFILE
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Preprocessing
Make queries faster (I’m afraid of full-table scan)ip => integers: ipv4 ipv4_0 ipv4_1 ipv4_2 ipv4_3 ipv6_0ipv6_1 ipv6_2 ipv6_3time => integers: time year yearday weekday daymin daytimehourstatus (200, 403, 404…)length (filesize)url => substrings: url_0 …url_9 filename extensionreferer (I do not want to analyze it, for Mirrors is not a site ofuser-interaction)ua => ua_0 (Mozilla, Ubuntu…), ua (full)
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Preprocessing
PHP is slow …only 3MB/s.At first preg_match (regular exps) take 85% time, then Ioptimized the regexp and it only takes 25% now.Xdebug show that stream_get_line (fgets) and fputs takeabout 50% of total time.InfoBright’s data loading speed is 15MB/s (crawl_http.log). Idon’t think PHP’s preprocessing work is harder thandatabase’s…Maybe PHP’s interpretive nature makes it much slower thanC. Anyone give a benchmark for Python etc?
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Files Table
Many files on mirrors are never accessed, so we have to makea full list of files on mirrors.Preprocessed url, filename, extension and filesize are recordedfor each file.When processing logs and files, escape characters andVARCHAR max length should be taken care of, and filenameshould not be limited to a simple regular expression, sincethere are always exceptions: UTF-8 strings in maliciousrequest, ”,v” files in CVS …
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. GNUPLOT
GNUPLOT is pretty flexible in plotting, the documentationand online demos are consulted many times.However GNUPLOT is not flexible in processing data. It isstrong type, where integers and strings need to be explicitlyconverted. And the integer type is limited to −231 231 − 1.When the query result needs to be postprocessed, I write asimple sed s/regexp/replace/ for simple replacement, or awkNR%n==0 for sampling. When it comes to accumulation orsome complex stuff, a PHP script goes between.
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Discovering Sessions
Sequentially scan the log. If a request can fill into an existingsession, update it; otherwise create a new one and flush thetimeout session if exists.Garbage collection of timeout sessions: if (count(array) >gc_limit) { unset timeout sessions; gc_limit = count(array) *1.5; }
Inspired by JavaScript GC algorithm in IE7: If recycledmemory is less than 15% of total, then limit is doubled; ifrecycled memory is more than 85%, then reset to initial value.
Maintain a query buffer of 10000 rows to reduce query num.The PHP takes 4 hours, 15 times slower than C (see thefollowing section).
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Some Bugs Due to DIE Time Functions (PHP)
PHP: Find some sessions with negative session length. Theonly explanation is that logs are not in time increasing order.Dive into the log table, only to find that a timestamp matchesrecords in both April 31 and May 1.In fact, mktime() accepts 5 parameters just like mktime(structtime_t) in C, while its month is 1 to 12, different from 0 to11. However strptime() is a simple encapsulation of its Crespective.I dare not to use DATETIME in MySQL, because thearithmetic of such timestamps is tricky, and I would ratherimplement it in SQL or PHP.Thanks Godness, no timezone problem this time.
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Some Bugs Due to DIE Time Functions (C)
C: malloc() fall into deadlock, so strange, I GDBed an eveningand no answer. Change some code and accidentally gotSegmentation Error in time().In fact, time() need a parameter of type time_t and it isdynamically linked. I used it without any parameter, andtime() treats the garbage on stack as its parameter. If it isNULL, nothing happens; if not, the position is considered astruct time_t and unpredictable stuff happen.
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Outline...1 Requests & Traffic
By TimeBy IPBy Other Measures
...2 FilesFiles CharacteristicsHow Files Are Requested
...3 Sessions
...4 Distributions InsightCentOSFedoraUbuntuEclipse
...5 Technical Details
...6 Query Optimization
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Query Optimization
select count(*), ipv4, count(*) as c, sum(length) as s,concat(ipv4_0,'.',ipv4_1,'.',ipv4_2,'.',ipv4_3) from logwhere ipv4 is not null group by ipv4 order by s limit 40;Query_time: 798.179622
2012-07-30 16:32:12 Cnd(0): VC:0(t0a0) IS NOT NULL (0)2012-07-30 16:33:02 Aggregating: 318575688 tuples left.2012-07-30 16:45:29 Aggregated (1415554 gr). Omittedpackrows: 0 + 0 partially, out of 5020 total.2012-07-30 16:45:29 Heap Sort initialized for 1415554 rows,8+61 bytes each.2012-07-30 16:45:30 Total data packs actually loaded(approx.): 35139
Infobright is column-based, so cross-column queries are slow.Try to generate IP string from ‘ipv4’ field.
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Query Optimization (continued)
SELECT ipv4, COUNT(*)/$COUNT, SUM(length)/$LENGTH AS c,CONCAT((ipv4 & 255<<24)>>24, '.', (ipv4 & 255<<16)>>16, '.',(ipv4 & 255<<8)>>8, '.', (ipv4 & 255)) FROM log WHERE ipv4IS NOT NULL GROUP BY ipv4 ORDER BY c DESC LIMIT 40;Query_time: 585.069205InfoBright packs column data into packages, andpre-computed MAX, MIN, AVG and GROUP data for eachpackage. WHERE clause requires re-computation of thesedata:
2012-07-30 14:33:06 Cnd(0): VC:0(t0a0) IS NOT NULL (0)2012-07-30 14:33:55 Aggregating: 318575688 tuples left.2012-07-30 14:42:47 Generating output.
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Query Optimization (continued)
SELECT ipv4, COUNT(*)/$COUNT, SUM(length)/$LENGTH AS c,CONCAT((ipv4 & 255<<24)>>24, '.', (ipv4 & 255<<16)>>16, '.',(ipv4 & 255<<8)>>8, '.', (ipv4 & 255)) FROM log GROUP BYipv4 ORDER BY c DESC LIMIT 40;Query_time: 328.580979WHERE clause is removed, but the result is wrong (includes alarge NULL which stands for IPv6).
2012-07-30 14:45:41 Unoptimized expression near ’/’2012-07-30 14:45:41 Unoptimized expression near ’/’2012-07-30 14:45:41 Unoptimized expression near ’concat’2012-07-30 14:45:41 Aggregating: 328976877 tuples left.
InfoBright calculates these columns for each tuple, no wonderit is slow. Keep the core subquery small. Moving thesecalculation outside may help.
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Query Optimization (continued)
SELECT ipv4, c/$COUNT, s/$LENGTH, CONCAT((ipv4 &255<<24)>>24, '.', (ipv4 & 255<<16)>>16, '.', (ipv4 &255<<8)>>8, '.', (ipv4 & 255)) FROM (SELECT ipv4,COUNT(*) AS c, SUM(length) AS s FROM log GROUP BY ipv4HAVING (ipv4 IS NOT NULL) ORDER BY s DESC LIMIT 40) AS t;Query_time: 138.297943
Total data packs actually loaded (approx.): 10040
Use HAVING clause to filter NULL. The internal exec order is:FROM TABLE, JOIN, OUTER JOIN, WHERE, SELECTclause, GROUP BY, HAVING, ORDER BY, LIMIT, projection& output.Turn on MySQL’s query_cache: identical queries should notbe re-executed if I change some GNUPLOT command and runthe script again.Is there any way to make the query faster? I don’t know.
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Query Optimization (continued)
Another example of moving expressions ‘outside’ inner query:SELECT ipv4_0, c/$COUNT, s/$LENGTH FROM (SELECTIFNULL(ipv4_0, 'IPv6') AS ipv4_0, COUNT(*) AS c,SUM(length) AS s FROM log GROUP BY ipv4_0 ORDER BY sDESC LIMIT 40) AS t;Query_time: 263.048662SELECT IFNULL(ipv4_0, 'IPv6'), c/$COUNT, s/$LENGTHFROM (SELECT ipv4_0, COUNT(*) AS c, SUM(length) AS sFROM log GROUP BY ipv4_0 ORDER BY s DESC LIMIT 40) AS t;Query_time: 84.589837I cannot believe my eyes at the first sight of these figures.
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Query Optimization is Not EverythingSELECT url_0, COUNT(*)/$COUNT, SUM(files.filesize)/$LENGTHAS c FROM files WHERE NOT EXISTS (SELECT * FROM log WHERElog.filename = files.filename) GROUP BY url_0 ORDER BY cDESC LIMIT 40;The query ran 12 hours and I have to kill it. Infobright treatsNOT EXISTS clause as dependent subquery and might haveto execute it for each row.I don’t know how to use cursors and storage procedure inMySQL, and more importantly Infobright does not supportdynamic update of tables, so SELECT INTO is impossible.I write a C program (using mysqlclient API) to countoccurences of each url in log and write it into another table.Performance:
10M rows in files table, 329M rows in log table17.4 minutes in total440000 rows per second at Probe phase460 MB of memory usage (360MB resident)
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Simulating Hash JOIN
I found my algorithm actually named Hash JOIN afterprogramming finished. It is a fast JOIN algorithm forRDBMS. The task is to find, for each distinct value of the joinattribute, the set of tuples in each relation which have thatvalue. (Wikipedia)My work is to count the occurence in log table of each url infiles table.
Build step: Traverse files table and add the url field of eachrow to a hash list.Probe step: Traverse log table and add the counter of thecorresponding hash slot.Output step: Traverse files table again and output thecorresponding counters. The output is FIFOed to LOAD DATAINFILE.
Easy to shard horizontally. Steps are similar to Map, Shuffleand Reduce.
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. Simulating Hash JOIN (continued)
Since there might be many files (currently only about 10M),storing the hash list entirely in memory is my top concern.Use one hash for position, two additional hashes for checking,and do not store the original string. 10−22 probability ofundetected hash collision, but it is low enough for an analyticprogram.Only requires 12 bytes for each slot (2 * sizeof(int) hashcheck, sizeof(int) counter)Hash algorithm: DJBX33A (hash = ((hash«5) + hash) +*key++), used by PHP, Apache and more. Use three seedsfor position and check hash.Hash collision: Linearly find the next empty slot. This issimple (I’m afraid of pointers) and memory-saving.
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. The End
I intended to finish it in 3 days, but because I’m unfamiliarwith these tools, it takes me a week (or more if includingpreparation time).Access logs are the origin of discoveries on access patterns.Data is precious, while disk is cheap. Please do not deletethem after 52 days.I cannot draw conclusion on ‘trends’, for there is not data foran adequately long time.If you want other statistics, please email me and I will query it(if I have time :).
李博杰 [email protected] Visualizing Mirrors
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
.. 写在最后
Thanks all maintainers and supporters of mirrors.ustc.edu.cn!All scripts and source of this slides are available at GitHub:https://github.com/bojieli/mirrors-log终于搞定了中文字体问题,不过懒得翻译了……
李博杰 [email protected] Visualizing Mirrors