+ All Categories
Home > Documents > Visualizing Mirrors - USTC

Visualizing Mirrors - USTC

Date post: 05-Nov-2021
Category:
Upload: others
View: 14 times
Download: 0 times
Share this document with a friend
119
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Visualizing Mirrors mirrors.ustc.edu.cn 服务器日志分析 李博杰 [email protected] ©USTC LUG August 14, 2012 李博杰 [email protected] Visualizing Mirrors
Transcript
Page 1: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.

......

Visualizing Mirrorsmirrors.ustc.edu.cn 服务器日志分析

李博杰 [email protected]

©USTC LUG

August 14, 2012

李博杰 [email protected] Visualizing Mirrors

Page 2: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Outline...1 Requests & Traffic

By TimeBy IPBy Other Measures

...2 FilesFiles CharacteristicsHow Files Are Requested

...3 Sessions

...4 Distributions InsightCentOSFedoraUbuntuEclipse

...5 Technical Details

...6 Query Optimization

李博杰 [email protected] Visualizing Mirrors

Page 3: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Notes

The data is access log of mirrors.ustc.edu.cn in 51 days. See‘Technical Details’ section for more info about dataset.Some graphs are in log-scale for clarity. Please note whether xaxis, y axis or both are in log-scale. The graph title sometimeslies.Because there may be many points in a graph, sampling ismade to reduce file size (they are vector graphics), hencethere may be some ‘straight lines’. I have checked the data tomake sure the graphs illustrate real trends.Title length is limited, so the title itself may not explain well,please keep an eye on the axis and keys of the graph.Graphs are shown in the hope of conveying informationwithout words. Any questions or suggestions, please email me.

李博杰 [email protected] Visualizing Mirrors

Page 4: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Requests & Traffic in a day

0

50000

100000

150000

200000

250000

300000

350000

400000

00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 00:00 0

1e+10

2e+10

3e+10

4e+10

5e+10

Req

uest

s

Tra

ffic

(Byt

es)

Time of the day

Requests & Traffic within a day

Request count (Bezier smoothed)Traffic (Bezier smoothed)

李博杰 [email protected] Visualizing Mirrors

Page 5: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Requests & Traffic in a week

0

1e+07

2e+07

3e+07

4e+07

5e+07

6e+07

7e+07

Monday Tuesday Wednesday Thursday Friday Saturday Sunday 0

1e+12

2e+12

3e+12

4e+12

5e+12

6e+12

7e+12

8e+12

Req

uest

s

Tra

ffic

(Byt

es)

Time of the day

Requests & Traffic in different weekdays

Request countTraffic

李博杰 [email protected] Visualizing Mirrors

Page 6: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Requests & Traffic across 50 days

0

2e+06

4e+06

6e+06

8e+06

1e+07

1.2e+07

05-20 05-27 06-03 06-10 06-17 06-24 07-01 07-08 07-15 0

2e+11

4e+11

6e+11

8e+11

1e+12

1.2e+12

Req

uest

s

Tra

ffic

(Byt

es)

Time of the day

Requests & Traffic in 50 days

Request countTraffic

李博杰 [email protected] Visualizing Mirrors

Page 7: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Statistics

Requests TrafficTotal 328976877 36892 GBAvg. per Day 6450527 723.4 GBMax. per Day 8632963 1049.5 GBMin. per Day 4868022 421.5 GBAvg. per Hour 268771 30.14 GBMax. per Hour 561925 79.75 GBMin. per Hour 99506 2.97 GBAvg. per Minute 4480 514.4 MBMax. per Minute 14714 N/AMin. per Minute 441 N/AAvg. per Second 74.66 8779 KBMax. per Second 2117 N/AMin. per Second 1 N/A

Because the time recorded is only completion time of the request, and large requests can span hours, soMax./Min. per minute/second is not applicable.

李博杰 [email protected] Visualizing Mirrors

Page 8: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cumulative Requests per Hour

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

550000

600000

0 10 20 30 40 50 60 70 80 90 100

Req

uest

s

Hour Percentage (sorted by Requests count)

Sorted Requests per Hour

Hour Percentage

李博杰 [email protected] Visualizing Mirrors

Page 9: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cumulative Requests per Minute

0

2000

4000

6000

8000

10000

12000

14000

16000

0 10 20 30 40 50 60 70 80 90 100

Req

uest

s

Minutes Percentage (sorted by Requests count)

Sorted Requests per Minute

Minutes Percentage

李博杰 [email protected] Visualizing Mirrors

Page 10: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cumulative Requests per Second

0

50

100

150

200

250

300

350

400

450

0 10 20 30 40 50 60 70 80 90 100

Req

uest

s

Seconds Percentage (sorted by Requests count)

Sorted Requests per Second

Seconds Percentage

李博杰 [email protected] Visualizing Mirrors

Page 11: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cumulative Traffic over IPs: 20%-80% law

0

5e+12

1e+13

1.5e+13

2e+13

2.5e+13

3e+13

3.5e+13

4e+13

1 10 100 1000 10000 100000 1e+06 1e+07

Cum

ulat

ive

Tra

ffic

Percentage of unique IP (log-scale) (sorted by Traffic DESC)

Cumulative Traffic over unique IPs

李博杰 [email protected] Visualizing Mirrors

Page 12: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cumulative Requests over IPs: 20%-80% law

0

5e+07

1e+08

1.5e+08

2e+08

2.5e+08

3e+08

3.5e+08

1 10 100 1000 10000 100000 1e+06 1e+07

Cum

ulat

ive

Req

uest

s

Percentage of unique IP (log-scale) (sorted by Request Num DESC)

Cumulative Requests over unique IPs

李博杰 [email protected] Visualizing Mirrors

Page 13: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. IPv4 vs. IPv6

Requests TrafficIPv4 318575688 (96.84%) 34180 GB (92.65%)IPv6 10401189 (3.15%) 2712 GB (7.35%)

It can be seen that IPv6 still have a long way to go…

李博杰 [email protected] Visualizing Mirrors

Page 14: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Requests & Traffic TOP 40: xxx.0.0.0/24

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

IPv6222

202113

21858 114

61 183180

121124

116219

210119

12259 221

125123

60 115117

118220

211203

112111

14 182110

27 1 120175

101159

223

Per

cent

age

Request & Traffic among IPv4 first fields

RequestsTraffic

李博杰 [email protected] Visualizing Mirrors

Page 15: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Traffic TOP 40: IPv4 addrs

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0.004

0.0045

0.005

219.133.0.1

218.242.250.212

203.114.244.88

114.212.189.93

114.113.226.53

180.169.73.90

180.96.19.25

66.197.225.53

218.3.125.243

61.234.123.57

159.226.126.177

203.198.202.225

124.74.45.130

220.181.145.27

114.213.255.162

202.108.130.138

202.119.45.31

222.66.23.57

124.127.250.34

208.53.156.36

221.216.135.54

116.228.240.198

113.108.76.195

114.80.133.7

210.13.71.73

180.149.134.10

124.126.245.14

218.94.63.55

220.248.0.145

112.65.134.2

222.56.17.109

124.74.78.2

124.207.104.18

58.211.218.74

1.202.225.132

116.226.65.12

220.248.0.154

222.94.140.45

210.73.5.33

116.247.98.50

Per

cent

age

Request & Traffic among popular IPv4 addrs

RequestsTraffic

李博杰 [email protected] Visualizing Mirrors

Page 16: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Request count TOP 40: IPv4 addrs

0

0.005

0.01

0.015

0.02

0.025

0.03

63.245.214.78

209.132.181.102

113.111.38.40

129.143.116.10

211.86.56.227

223.5.20.10

49.123.105.219

159.226.20.217

60.208.111.199

119.97.142.81

202.38.95.60

203.244.218.6

221.219.75.222

202.104.151.152

121.49.96.70

59.77.33.100

182.89.199.227

183.45.54.83

218.13.224.81

59.37.44.133

59.44.42.194

222.134.53.246

210.21.243.170

61.130.247.168

116.228.202.66

123.185.172.126

219.134.89.202

114.113.29.21

183.31.242.39

58.19.126.37

27.17.19.75

203.114.244.88

220.178.52.108

124.42.77.160

113.111.40.89

218.94.63.55

180.153.97.82

222.171.60.177

210.34.196.99

222.92.29.130

Per

cent

age

Request & Traffic among popular IPv4 addrs

RequestsTraffic

李博杰 [email protected] Visualizing Mirrors

Page 17: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. USTC Mirrors Usage (IPv4 only)

IP range Requests Traffic Note

202.38.64.0-202.38.95.255 994661 (0.30%) 149.3 GB (0.40%) CERNET

210.45.64.0-210.45.79.255 191141 (0.06%) 68.31 GB (0.19%) CERNET

210.45.112.0-210.45.127.255 243976 (0.07%) 37.59 GB (0.10%) CERNET

211.86.144.0-211.86.159.255 81035 (0.02%) 24.96 GB (0.07%) CERNET

222.195.64.0-222.195.95.255 319435 (0.10%) 86.88 GB (0.24%) CERNET

114.214.160.0-114.214.255.255 0 0 CERNET

210.72.22.0-210.72.22.255 3622 (0.00%) 11.86 MB (0.00%) TechNet (?)

218.22.21.0-218.22.21.31 1 (0.00%) 0.01 MB (0.00%) China Telecom

218.104.71.160-218.104.71.175 0 0 China Unicom

202.141.160.0-202.141.175.255 123455 (0.04%) 12.60 GB (0.03%) China Telecom

202.141.176.0-202.141.191.255 187 (0.00%) 120.4 MB (0.00%) China Mobile

Total 1957513 (0.60%) 379.77 GB (1.03%) USTC IPv4

Data source of USTC IP range: http://lib.ustc.edu.cn/ustcip.html

李博杰 [email protected] Visualizing Mirrors

Page 18: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Requests & Traffic of distributions

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

eclipse

fedora

ubuntu

centos

debian

tdfcygwin

CTANarchlinux

mozilla-current

opensuse

gentoo

kde-applicationdata

kdebacktrack

epelgnu

CRAN

NULLfreebsd

ubuntu-releases

debian-security

scientificlinux

linux-kernel

debian-backports

kdemod

debian-cd

meego

slackware

deepin

sourceware.org

CPAN

linuxmint

linux-2.6.git

puppy

linux.git

debian-multim

edia

4 qomo3.7

Per

cent

age

Request & Traffic of Distributions

RequestsTraffic

李博杰 [email protected] Visualizing Mirrors

Page 19: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Requests & Traffic of distributions

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

centos

NULLeclipse

fedora

ubuntu

ubuntu-releases

mozilla-current

tdfdebian

CTANbacktrack

opensuse

gentoo

deepin-cd

kde-applicationdata

debian-cd

cygwin

Ubuntu

linuxmint

archlinux

kdegnu

linuxmint-cd

CRAN

debian-multim

edia

qomodebian-security

epelpuppy

scientificlinux

debian-backports

freebsd

deepin

CPAN

kdemod

turnkeylinux

slackware

debian-uo

gentoo-portage

linux-kernel

Per

cent

age

Request & Traffic of Distributions

RequestsTraffic

李博杰 [email protected] Visualizing Mirrors

Page 20: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Requests & Traffic among HTTP Status Codes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

200 206 301 304 400 403 404 405 408 416 499 500 502

Per

cent

age

HTTP Status Code

Request & Traffic among HTTP status codes

RequestsTraffic

李博杰 [email protected] Visualizing Mirrors

Page 21: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Requests & Traffic among User Agents

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Mozilla

urlgrabber

Debian

Wget

Jakarta

NULLFedora

Preupgrade

Ubuntu

pacman

libwww

texlive

Cygwin

jigdoOpera

ZYppAxel

CentOS

MPM

FDMlftp NSIS

Javaanaconda

curlEclipse

aria2Python

BTWebClient

R

Per

cent

age

User Agent (Only largest 30 are shown)

Request & Traffic of User Agents

RequestsTraffic

李博杰 [email protected] Visualizing Mirrors

Page 22: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Traffic per Request order by Length

1

10

100

1000

10000

100000

1e+06

1e+07

1e+08

1e+09

1e+10

5e+07 1e+08 1.5e+08 2e+08 2.5e+08 3e+08 3.5e+08

Req

uest

Len

gth

(Log

-sca

le)

Request Num

Request Num sorted by Request Length

李博杰 [email protected] Visualizing Mirrors

Page 23: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cumulative Traffic sorted by Request Length

0

5e+12

1e+13

1.5e+13

2e+13

2.5e+13

3e+13

3.5e+13

4e+13

100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10

Cum

ulat

ive

Tra

ffic

Request Length (log-scale)

Cumulative Request Traffic sorted by Request Length

李博杰 [email protected] Visualizing Mirrors

Page 24: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Request Num sorted by Request Length

0

2e+07

4e+07

6e+07

8e+07

1e+08

1.2e+08

1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09

Req

uest

Num

Request Length (Log-scale)

Request Num sorted by Request Length

李博杰 [email protected] Visualizing Mirrors

Page 25: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Outline...1 Requests & Traffic

By TimeBy IPBy Other Measures

...2 FilesFiles CharacteristicsHow Files Are Requested

...3 Sessions

...4 Distributions InsightCentOSFedoraUbuntuEclipse

...5 Technical Details

...6 Query Optimization

李博杰 [email protected] Visualizing Mirrors

Page 26: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. File Number at different FileSizes (log scale)

0

5000

10000

15000

20000

25000

1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10

File

Num

FileSize (Log-scale)

Sorted Filenum by Filesize

李博杰 [email protected] Visualizing Mirrors

Page 27: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Total Size at different Filesizes (log scale)

0

2e+09

4e+09

6e+09

8e+09

1e+10

1.2e+10

1.4e+10

1.6e+10

1.8e+10

2e+10

100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10

File

Siz

e *

File

num

FileSize

Total Size at different Filesizes (normal scale)

李博杰 [email protected] Visualizing Mirrors

Page 28: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cumulative Filesize per File Num (normal scale)

0

2e+12

4e+12

6e+12

8e+12

1e+13

1.2e+13

1.4e+13

0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 1e+07 1.1e+07

File

Siz

e

FileNum

Accumulated FileSize by FileNum sorted by Size (normal scale)

李博杰 [email protected] Visualizing Mirrors

Page 29: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cumulative Filesize per File Num (log scale)

1e+07

1e+08

1e+09

1e+10

1e+11

1e+12

1e+13

1e+14

0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 1e+07 1.1e+07

File

Siz

e

FileNum

Accumulated FileSize by FileNum sorted by Size (log scale y)

李博杰 [email protected] Visualizing Mirrors

Page 30: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cumulative Filesize per FileSize (normal scale)

0

2e+12

4e+12

6e+12

8e+12

1e+13

1.2e+13

1.4e+13

0 1e+09 2e+09 3e+09 4e+09 5e+09 6e+09 7e+09

Acc

umul

ated

File

Siz

e

FileSize

Accumulated Filesize by Filesize

李博杰 [email protected] Visualizing Mirrors

Page 31: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cumulative Filesize per FileSize (log scale)

0

2e+12

4e+12

6e+12

8e+12

1e+13

1.2e+13

1.4e+13

1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10

Acc

umul

ated

File

Siz

e

FileSize (Log-scale)

Accumulated Filesize by Filesize

李博杰 [email protected] Visualizing Mirrors

Page 32: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. File Extensions TOP 40 (order by Total Size)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

tbz

rpm

iso

deb

gz bz2

xz zip

drpm

img

tgz

mar

jar

png

ogg

exe

txz

dmg

tar

pet

sfs

run

7z udeb

lzma

pkg

ogv

pdf

xpi

tbz2

cb m4v

msi

1 log

2 jpg

92 3 wz

Per

cent

ageFileNum and Filesize of popular file extensions

Total NumTotal Size

李博杰 [email protected] Visualizing Mirrors

Page 33: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. File Extensions TOP 40 (order by File Num)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

tbz

rpm

deb

gz png

drpm

readme

bz2

jar

dsc

hdr

xz zip

tgz

meta

c,v

sig

log

tfm txt

ebuild

jpg

asc

xml

changes

udeb

md5

h,v

xpi

pdf

tex

patch

html

sign

in,v

news

ltx sha1

txz

0

Per

cent

ageFileNum and Filesize of popular file extensions

Total NumTotal Size

李博杰 [email protected] Visualizing Mirrors

Page 34: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. How many files share a same filename (extension excluded)

0.0001

0.001

0.01

0.1

1

1 10 100 1000 10000 100000 1e+06

Per

cent

age

(logs

cale

)

Number of Files sharing a same FileName (logscale)

FileNum and FileSize of Filenames with Different Number of Shares

Total NumTotal Size

李博杰 [email protected] Visualizing Mirrors

Page 35: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. How many files share a same filename and extension

0.0001

0.001

0.01

0.1

1

1 10 100 1000 10000 100000 1e+06

Per

cent

age

(logs

cale

)

Number of Files sharing a same FileName and Extension (logscale)

FileNum and FileSize of Filename swith Different Number of Shares

Total NumTotal Size

李博杰 [email protected] Visualizing Mirrors

Page 36: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Num & Size of Files in each Distribution (sorted by Size)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

freebsd

fedora

debian-cd

scientificlinux

debian

ubuntu

opensuse

meego

gnome

gentoo

backtrack

centos

epeleclipse

slackware

ubuntu-releases

kdelinux-kernel

kde-applicationdata

linuxmint-cd

archlinux

mozilla-current

bin deepin-cd

turnkeylinux

qomokdem

od

CTANprogress-linux

puppy

gnudebian-backports

knoppix-dvd

debian-security

tdfsrc

cygwin

loongson2f

deepin

linuxmint

Per

cent

age

Num & Size of Files in each Distribution

NumberSize

李博杰 [email protected] Visualizing Mirrors

Page 37: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Num & Size of Files in each Distribution (sorted by Num)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

freebsd

fedora

kde-applicationdata

debian

ubuntu

opensuse

modules

eclipse

CTANauthors

epelgnom

e

gentoo-portage

slackware

backtrack

meego

scientificlinux

bin centos

gentoo

archlinux

kdeweb

srcqom

odebian-backports

macports

Xorglinux-kernel

cygwin

gnum

ozilla-current

progress-linux

debian-security

debian-cd

kdemod

linuxmint

debian-multim

edia

puppy

tdf

Per

cent

age

Num & Size of Files in each Distribution

NumberSize

李博杰 [email protected] Visualizing Mirrors

Page 38: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cumulative Requests over FileSize order by Size

0

2e+07

4e+07

6e+07

8e+07

1e+08

1.2e+08

1.4e+08

1.6e+08

1.8e+08

2e+08

0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07

Acc

umul

ated

Req

uest

s C

ount

FileSize

Cumulative Requests Num per File order by Filesize

李博杰 [email protected] Visualizing Mirrors

Page 39: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cumulative Traffic over non-cumu. FileSize

0

5e+12

1e+13

1.5e+13

2e+13

2.5e+13

0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 1e+07 1.1e+07

Acc

umul

ated

Tra

ffic

FileSize

Cumulative Traffic over non-cumulated Filesize

李博杰 [email protected] Visualizing Mirrors

Page 40: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cumulative Traffic over non-cumu. Filesize (log-scale)

1e+09

1e+10

1e+11

1e+12

1e+13

1e+14

0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 1e+07 1.1e+07

Cum

ulat

ed T

raffi

c (lo

g-sc

ale)

FileSize

Cumulative Traffic over non-cumulated FileSize

李博杰 [email protected] Visualizing Mirrors

Page 41: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cache 1: Cumu. Traffic over FileSize order by Size

0

5e+12

1e+13

1.5e+13

2e+13

2.5e+13

0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13

Cum

ulat

ive

Tra

ffic

Cumulative FileSize (order by FileSize)

Cumulative Traffic over Cumulative FileSize

李博杰 [email protected] Visualizing Mirrors

Page 42: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cache 2: Cumu. Traffic over FileSize order by Size DESC

0

5e+12

1e+13

1.5e+13

2e+13

2.5e+13

0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13

Cum

ulat

ive

Tra

ffic

Cumulative FileSize (order by FileSize DESC)

Cumulative Traffic over Cumulative FileSize

李博杰 [email protected] Visualizing Mirrors

Page 43: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cache 3: Cumu. Traffic over FileSize order by Req Num

0

5e+12

1e+13

1.5e+13

2e+13

2.5e+13

0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13

Cum

ulat

ive

Tra

ffic

Cumulative FileSize (order by request num DESC)

Cumulative Traffic over Cumulative FileSize

李博杰 [email protected] Visualizing Mirrors

Page 44: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cache 4: Cumu. Traffic over FileSize order by Traffic

6e+12

8e+12

1e+13

1.2e+13

1.4e+13

1.6e+13

1.8e+13

2e+13

2.2e+13

2.4e+13

2.6e+13

0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13

Cum

ulat

ive

Tra

ffic

Cumulative FileSize (order by traffic of this file DESC)

Cumulative Traffic over Cumulative FileSize

李博杰 [email protected] Visualizing Mirrors

Page 45: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cache 5: Cumu. Traffic order by Traffic/FileSize

0

5e+12

1e+13

1.5e+13

2e+13

2.5e+13

0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13

Cum

ulat

ive

Tra

ffic

Cumulative FileSize (order by traffic/filesize DESC)

Cumulative Traffic over Cumulative FileSize

李博杰 [email protected] Visualizing Mirrors

Page 46: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Comparison of the Previous five ‘Caching Policies’

0

5e+12

1e+13

1.5e+13

2e+13

2.5e+13

0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13

Cum

ulat

ive

Tra

ffic

Cumulative FileSize order by different metrics

Cumulative Traffic over Cumulative FileSize

FileSizeFileSize DESC

Requests Num DESCTraffic DESC

Traffic/FileSize DESC

李博杰 [email protected] Visualizing Mirrors

Page 47: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Comparison of ‘Caching Policies’ (log-scale)

0

5e+12

1e+13

1.5e+13

2e+13

2.5e+13

100 10000 1e+06 1e+08 1e+10 1e+12 1e+14

Cum

ulat

ive

Tra

ffic

Cumulative FileSize order by different metrics (log-scale)

Cumulative Traffic over Cumulative FileSize

FileSizeFileSize DESC

Requests Num DESCTraffic DESC

Traffic/FileSize DESC

李博杰 [email protected] Visualizing Mirrors

Page 48: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Comparison of ‘Caching Policies’

Among these static caching policies, caching files with mosttraffic or requests is acceptable.Caching files that carried most traffic in history has a goodperformance. A 10GB cache of 85 files can cover 40% of thetotal traffic. If cache size continue to increase, the cacheefficiency will deteriorate, since x axis of the graph is inlog-scale.Caching files with largest traffic/filesize ratio shows bestperformance (by definition). A 10GB cache of 7162 files cancover 58% of the total traffic.

李博杰 [email protected] Visualizing Mirrors

Page 49: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Details of Most Traffic Caching (log-scale)

0

0.2

0.4

0.6

0.8

1

1e+08 1e+09 1e+10 1e+11 1e+12 1e+13 1e+14

Cum

ulat

ive

Rat

io

Cumulative FileSize (log-scale) (order by traffic of this file DESC)

Cumulative Ratio over Cumulative FileSize

Cumulative Traffic (\%)Cumulative Requests (\%)Cumulative File Num (\%)

李博杰 [email protected] Visualizing Mirrors

Page 50: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Details of Traffic/FileSize Caching (log-scale)

0

0.2

0.4

0.6

0.8

1

1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10 1e+11 1e+12 1e+13 1e+14

Cum

ulat

ive

Rat

io

Cumulative FileSize (log-scale) (order by Traffic/FileSize of this file DESC)

Cumulative Ratio over Cumulative FileSize

Cumulative Traffic (\%)Cumulative Requests (\%)Cumulative File Num (\%)

李博杰 [email protected] Visualizing Mirrors

Page 51: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. An Alternate Metric: Request Hits

0

2e+07

4e+07

6e+07

8e+07

1e+08

1.2e+08

1.4e+08

1.6e+08

1.8e+08

2e+08

0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13

Cum

ulat

ive

Req

uest

s

Cumulative FileSize order by different metrics

Cumulative Requests over Cumulative FileSize

FileSizeFileSize DESC

Requests Num DESCTraffic DESC

李博杰 [email protected] Visualizing Mirrors

Page 52: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. An Alternate Metric: Request Hits

0

2e+07

4e+07

6e+07

8e+07

1e+08

1.2e+08

1.4e+08

1.6e+08

1.8e+08

2e+08

100 10000 1e+06 1e+08 1e+10 1e+12 1e+14

Cum

ulat

ive

Req

uest

s

Cumulative FileSize order by different metrics (log-scale)

Cumulative Requests over Cumulative FileSize

FileSizeFileSize DESC

Requests Num DESCTraffic DESC

李博杰 [email protected] Visualizing Mirrors

Page 53: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Num & Size of Never-accessed Files (% of total)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

freebsd

fedora

kde-applicationdata

debian

ubuntu

modules

opensuse

eclipse

authors

epelgnom

e

gentoo-portage

CTANslackware

meego

backtrack

scientificlinux

bin gentoo

archlinux

kdeweb

srcqom

om

acports

centos

debian-backports

Xorglinux-kernel

gnum

ozilla-current

progress-linux

debian-cd

debian-security

cygwin

kdemod

linuxmint

debian-multim

edia

tdfloongson2f

Per

cent

age

of T

otal

Num & Size of Never-Accessed Files in each Distribution

NumberSize

李博杰 [email protected] Visualizing Mirrors

Page 54: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Num & Size of Never-accessed Files (% of total)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

freebsd

fedora

debian-cd

scientificlinux

debian

opensuse

meego

ubuntu

gnome

gentoo

epelslackware

backtrack

linux-kernel

kdeeclipse

kde-applicationdata

ubuntu-releases

mozilla-current

bin archlinux

centos

linuxmint-cd

turnkeylinux

qomokdem

od

progress-linux

deepin-cd

knoppix-dvd

CTANdebian-backports

gnusrc

tdfpuppy

loongson2f

debian-security

linuxmint

authors

deepin

Per

cent

age

of T

otal

Num & Size of Never-Accessed Files in each Distribution

NumberSize

李博杰 [email protected] Visualizing Mirrors

Page 55: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Num & Size of Never-accessed Files (% of distribution)

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

clpahtm

lm

odules

m odules

authors

bin srcweb

mozilla-current

docqom

oscripts

freebsd

meego

knoppix-dvd

portscontrib

kde-applicationdata

linux-kernel

scientificlinux

knoppix

debian-cd

tdfgnom

e

Xorgslackware

turnkeylinux

backtrack

debian-volatile

eclipse

epeldotdeb

progress-linux

macports

gentoo-portage

misc

indices

kdegentoo

debian-multim

edia

Per

cent

age

of D

istr

ibut

ion

Num & Size of Never-Accessed Files in each Distribution

NumberSize

李博杰 [email protected] Visualizing Mirrors

Page 56: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Num & Size of Never-accessed Files (% of distribution)

0.75

0.8

0.85

0.9

0.95

1

knoppix-dvd

mozilla-current

htmlcontrib

bin src m odules

modules

indices

docclpa

authors

portsdotdeb

webdebian-volatile

meego

freebsd

slackware

scripts

misc

Xorggnom

e

epelprogress-linux

qomoscientificlinux

knoppix

debian-cd

turnkeylinux

debian-multim

edia

gentoo-portage

linuxmint

loongson2f

kdemod

linux-kernel

deepin

tdfkde-applicationdata

kde

Per

cent

age

of D

istr

ibut

ion

Num & Size of Never-Accessed Files in each Distribution

NumberSize

李博杰 [email protected] Visualizing Mirrors

Page 57: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Num & Size of Ever-accessed Files (% of distribution)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

finkCRAN

CPAN

helpcygwin

centos

puppy

ubuntu

deepin-cd

mirm

on

debian

ubuntu-releases

gnulinuxm

int-cd

CTANdebian-security

deepin

uksm-kernel

fedora

kdemod

opensuse

debian-backports

loongson2f

linuxmint

archlinux

debian-multim

edia

gentoo

kdeindices

misc

gentoo-portage

macports

progress-linux

dotdeb

epeleclipse

debian-volatile

backtrack

turnkeylinux

slackware

Per

cent

age

of D

istr

ibut

ion

Num & Size of Ever-Accessed Files in each Distribution

NumberSize

李博杰 [email protected] Visualizing Mirrors

Page 58: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Num & Size of Ever-accessed Files (% of distribution)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

centos

cygwin

deepin-cd

puppy

CTANeclipse

ubuntu

mirm

on

linuxmint-cd

macports

ubuntu-releases

gnudebian-security

debian

fedora

archlinux

backtrack

opensuse

debian-backports

uksm-kernel

gentoo

kdekde-applicationdata

tdfdeepin

linux-kernel

kdemod

loongson2f

linuxmint

gentoo-portage

debian-multim

edia

turnkeylinux

debian-cd

knoppix

scientificlinux

qomoprogress-linux

epelgnom

e

Xorg

Per

cent

age

of D

istr

ibut

ion

Num & Size of Ever-Accessed Files in each Distribution

NumberSize

李博杰 [email protected] Visualizing Mirrors

Page 59: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Outline...1 Requests & Traffic

By TimeBy IPBy Other Measures

...2 FilesFiles CharacteristicsHow Files Are Requested

...3 Sessions

...4 Distributions InsightCentOSFedoraUbuntuEclipse

...5 Technical Details

...6 Query Optimization

李博杰 [email protected] Visualizing Mirrors

Page 60: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Discovering Sessions

.Definition..

......

Two requests are within a same Interval Session iff:Have same IP addressThe time difference does not exceed some limit

.Definition..

......

A Gap Session is a longest sequence of requests where:All requests are from the same IP addressTime difference of every two adjacent requests do not exceedsome limit

李博杰 [email protected] Visualizing Mirrors

Page 61: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Discovering Sessions (continued)

Since Mirrors is a resource-downloading site, many sessionsdownload many files for hours. The time limit of IntervalSession is 60 minutes.The time limit of Gap Session is 30 minutes. For longdownloads that extend over 30 minutes, the session will bebroken into two.Both algorithms suffer from false positives and true negatives.Some IPs access so frequently that the gap session never ends(see the next slide), making the Gap Session data highlybiased.

李博杰 [email protected] Visualizing Mirrors

Page 62: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Average Session Duration over IP (log-scale)

0.001

0.01

0.1

1

10

100

1000

10000

100000

1e+06

1e+07

1 10 100 1000 10000 100000 1e+06 1e+07

Ave

rage

Dur

atio

n (lo

g-sc

ale)

Percentage of unique IP (log-scale) (sorted by Duration DESC)

Average Duration over Unique IPs

Gap SessionInterval Session

李博杰 [email protected] Visualizing Mirrors

Page 63: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Average Session Duration over IP (normal-scale)

0.001

0.01

0.1

1

10

100

1000

10000

100000

1e+06

1e+07

0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06 1.6e+06

Ave

rage

Dur

atio

n (lo

g-sc

ale)

Percentage of unique IP (sorted by Duration DESC)

Average Duration over Unique IPs

Gap SessionInterval Session

李博杰 [email protected] Visualizing Mirrors

Page 64: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Gap Session Statistics in a day

0

0.0002

0.0004

0.0006

0.0008

0.001

0.0012

0.0014

0.0016

00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 00:00

Rat

io o

f tot

al

Time in a day

Session statistics at different Time in a day

Session count (Bezier smoothed)Request count (Bezier smoothed)

Duration (Bezier smoothed)Traffic (Bezier smoothed)

李博杰 [email protected] Visualizing Mirrors

Page 65: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Explanation

How strange the three graphs look! Durations are alwaysintegers (seconds), so there are straight lines in durationgraphs.Many sessions are actually ‘single’ requests, so their durationis zero, and the amount of them can be seen in normal-scalegraph.About 40 sessions extend throughout the whole 51 days, andmore sessions extend less days. The amount of them is thelength of red horizontal line in log-scale graph.Because the log starts at May 22 06:25:37, requests in theselong-live Gap Sessions accumulate to a horrible peak at 06:27in Gap Session statistics (much higher than shown, since theoriginal data is noizy, and the graph is Bezier-smoothed).

李博杰 [email protected] Visualizing Mirrors

Page 66: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cumulative Over Long-live Gap Sessions

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Cum

ulat

ive

Rat

io o

f Tot

al

Sessions order by Duration DESC

Cumulative Ratio of Requests, Duration and Traffic over Sessions

Request countDuration

Traffic

李博杰 [email protected] Visualizing Mirrors

Page 67: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Ratio to Average for Long-live Gap Sessions (log-scale)

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

100000

1e+06

1e+07

1 10 100 1000 10000

Rat

io to

Ave

rage

(Lo

g-sc

ale)

Sessions order by Duration DESC

Requests, Duration and Traffic over Sessions

Request countDuration

Traffic

李博杰 [email protected] Visualizing Mirrors

Page 68: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Statistics of Interval Sessions

Total Average Max Min Std. DevRequests 328976877 7.5020 186043 1 173.1824Traffic 36277 GB 867.46 KB Overflow 0 15.227 MBDuration 4.026 ∗ 1010 918.13 s 3599 s -6 s 1502.9 s

43852053 sessions in total.Sessions with negative duration is because log items canaccidentally be not in time non-decreasing order. I’m notgoing to fix it, since these 230 wrong sessions have littleinfluence.

李博杰 [email protected] Visualizing Mirrors

Page 69: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Statistics of Gap Sessions

Total Average Max Min Std. DevRequests 328976877 6.6200 8432330 1 1255.9488Traffic 35291 GB 744.67 KB Overflow 0 15.184 MBDuration 7.666 ∗ 109 154.27 s 4406403 -27 s 7060.6 s

49694582 sessions in total.For comparison with Interval Sessions.The following statistics are based on Interval Session if notnoted specially.

李博杰 [email protected] Visualizing Mirrors

Page 70: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cumulatives over Sessions order by Traffic DESC

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5e+06 1e+07 1.5e+07 2e+07 2.5e+07 3e+07 3.5e+07 4e+07 4.5e+07

Cum

ulat

ive

stat

Number of Sessions (order by traffic DESC)

Cumulative stat over Sessions order by Traffic DESC

Requests countDuration

Traffic

李博杰 [email protected] Visualizing Mirrors

Page 71: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cumulatives over Sessions order by Traffic DESC

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

100 1000 10000 100000 1e+06 1e+07 1e+08

Cum

ulat

ive

stat

Number of Sessions (log-scale) (order by traffic DESC)

Cumulative stat over Sessions order by Traffic DESC

Requests countDuration

Traffic

李博杰 [email protected] Visualizing Mirrors

Page 72: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cumulatives over Sessions order by Requests DESC

0

0.2

0.4

0.6

0.8

1

1 10 100 1000 10000 100000 1e+06 1e+07 1e+08

Cum

ulat

ive

stat

Number of Sessions (log-scale) (order by Requests DESC)

Cumulative stat over Sessions order by Requests DESC

Requests countDuration

Traffic

李博杰 [email protected] Visualizing Mirrors

Page 73: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cumulatives over Sessions order by Duration DESC

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 10 100 1000 10000 100000 1e+06 1e+07 1e+08

Cum

ulat

ive

stat

Number of Sessions (log-scale) (order by Duration DESC)

Cumulative stat over Sessions order by duration DESC

Requests countDuration

Traffic

李博杰 [email protected] Visualizing Mirrors

Page 74: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Cumulative Session Duration over Unique IPs

0

5e+09

1e+10

1.5e+10

2e+10

2.5e+10

3e+10

3.5e+10

4e+10

1 10 100 1000 10000 100000 1e+06 1e+07

Cum

ulat

ive

Dur

atio

n

Percentage of unique IP (log-scale) (sorted by Duration DESC)

Cumulative Interval Session Duration over Unique IPs

李博杰 [email protected] Visualizing Mirrors

Page 75: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Sessions in a day

0

0.0002

0.0004

0.0006

0.0008

0.001

0.0012

0.0014

0.0016

00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 00:00

Rat

io o

f tot

al

Time in a day

Session statistics at different Time in a day

Session count (Bezier smoothed)Request count (Bezier smoothed)

Duration (Bezier smoothed)Traffic (Bezier smoothed)

李博杰 [email protected] Visualizing Mirrors

Page 76: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Sessions across days

0

0.005

0.01

0.015

0.02

0.025

0.03

05-20 05-27 06-03 06-10 06-17 06-24 07-01 07-08 07-15

Rat

io o

f tot

al

Date

Session statistics across days

Session countRequest count

DurationTraffic

李博杰 [email protected] Visualizing Mirrors

Page 77: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Sessions in Distributions order by Traffic

0

0.1

0.2

0.3

0.4

0.5

0.6

eclipse

fedora

centos

ubuntu

debian

tdfcygwin

CTANarchlinux

mozilla-current

opensuse

gentoo

kde-applicationdata

Ubuntu

kdebacktrack

ubuntu-releases

gnuCRAN

NULL

Per

cent

ageSessions in each Distribution (order by Traffic)

NumberRequestsDuration

Traffic

李博杰 [email protected] Visualizing Mirrors

Page 78: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Sessions in Distributions order by Requests

0

0.1

0.2

0.3

0.4

0.5

0.6

centos

NULLeclipse

fedora

ubuntu

ubuntu-releases

mozilla-current

tdfdebian

CTANbacktrack

opensuse

Ubuntu

deepin-cd

kde-applicationdata

linuxmint

debian-cd

cygwin

archlinux

kde

Per

cent

age

Sessions in each Distribution (order by Requests)

NumberRequestsDuration

Traffic

李博杰 [email protected] Visualizing Mirrors

Page 79: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Sessions in Distributions order by Session Num

0

0.1

0.2

0.3

0.4

0.5

0.6

NULLcentos

fedora

eclipse

ubuntu

mozilla-current

debian

ubuntu-releases

kde-applicationdata

opensuse

epeltdf

archlinux

CTANcygwin

CRAN

backtrack

deepin-cd

gnukde

Per

cent

age

Sessions in each Distribution (order by Session Num)

NumberRequestsDuration

Traffic

李博杰 [email protected] Visualizing Mirrors

Page 80: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Sessions in Distributions order by Duration

0

0.1

0.2

0.3

0.4

0.5

0.6

centos

NULLfedora

eclipse

ubuntu

mozilla-current

debian

ubuntu-releases

epeltdf

archlinux

opensuse

CTANcygwin

kde-applicationdata

backtrack

gnudeepin-cd

kdegentoo

Per

cent

age

Sessions in each Distribution (order by Duration)

NumberRequestsDuration

Traffic

李博杰 [email protected] Visualizing Mirrors

Page 81: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Sessions and User Agents

0

0.1

0.2

0.3

0.4

0.5

0.6

NULLurlgrabber

Mozilla

Jakarta

Debian

Ubuntu

ZYppJava

pacman

Wget

Python

Eclipse

curllibwww

Homebrew

Opera

Fedora

Preupgrade

MirrorBrain

Sosospider

Per

cent

age

Sessions in each User Agent (order by Session Num DESC)

NumberRequestsDuration

Traffic

李博杰 [email protected] Visualizing Mirrors

Page 82: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Outline...1 Requests & Traffic

By TimeBy IPBy Other Measures

...2 FilesFiles CharacteristicsHow Files Are Requested

...3 Sessions

...4 Distributions InsightCentOSFedoraUbuntuEclipse

...5 Technical Details

...6 Query Optimization

李博杰 [email protected] Visualizing Mirrors

Page 83: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. CentOS Basic Stat

Value % of total RankRequests 100986986 30.7% 1Traffic 5252.5 GB 14.2% 4Files 70043 0.7% 19FileSize 172.8 GB 1.3% 12Sessions 19452260 44.4% 1

李博杰 [email protected] Visualizing Mirrors

Page 84: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. CentOS Basic Stat - Average

Average Ratio Std.Dev RatioRequest Length 55846 0.464 882920 0.289File Size 2665431 1.953 54320911 2.294File Requests 1402.46 76.994 92327 11.029File Traffic 74454404 32.12 1231957014 1.784Session Duration 1051.19 1.145 1586.69 1.056Session Requests 5.796 0.773 137.5 0.794Session Traffic 336791 0.379 8931320 0.559

Std.Dev stands for Standard Deviation.‘Ratio’ in the third col stands for Ratio of DistributionAverage to Global Average; ‘Ratio’ in the fifth col stands forRatio of DIst Std.Dev to Global.Size & Traffic are in bytes, Duration is in seconds.

李博杰 [email protected] Visualizing Mirrors

Page 85: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Requests & Traffic among CentOS Versions

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

5.86.2

5 6.36 build

4.45.1

3.53.4

3.33.1

2.16Server

dostools

5.04.7

4.54.0

6.14.6

3 graphics

4.92 %

E6%96%

87%E7%

AB%A0%

E5%87%

BA%E5%

A4%84%

EF%BC%

9A%E9%

A

4.8HEADER.im

ages

4.23.9

5.3NULL

6.05.2

5Server

5.74 5.6

5.4%

24releasever

Per

cent

ageRequest & Traffic of Subdirectories in centos

RequestsTraffic

李博杰 [email protected] Visualizing Mirrors

Page 86: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Requests & Traffic among CentOS 2-level Subdirs

0

0.05

0.1

0.15

0.2

0.25

0.3

5.8/updates

6.2/updates

5.8/os

6.2/os

5/updates

6.3/os

5/os6/os

6/updates

5.8/extras

6.3/updates

5/extras

6.2/centosplus

5/addons

6.2/isos

5.8/centosplus

5.8/isos

5/centosplus

6/centosplus

6.3/isos

6.3/centosplus

6.2/extras

NULL5.7/isos

5/contrib

6/isos

6/extras

6.0/isos

NULL5.8/addons

5/isos

6.3/extras

5.7/updates

5.7/os

5.7/extras

6.2/fasttrack

6/fasttrack

5/fasttrack

5.8/fasttrack

5.5/os

Per

cent

ageRequest & Traffic of Subdirectories in centos

RequestsTraffic

李博杰 [email protected] Visualizing Mirrors

Page 87: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Fedora Basic Stat

Value % of total RankRequests 36329528 11.0% 4Traffic 7509.2 GB 20.3% 2Files 924620 8.8% 2FileSize 1207.8 GB 9.4% 2Sessions 3596727 8.2% 2

李博杰 [email protected] Visualizing Mirrors

Page 88: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Fedora Basic Stat - Average

Average Ratio Std.Dev RatioRequest Length 221945 1.843 2277229 0.744File Size 1407150 1.031 21376814 0.903File Requests 28.3134 1.554 6563 0.784File Traffic 3601714 1.554 190851795 0.276Session Duration 1226 1.336 1598 1.063Session Requests 10.8047 1.440 251.43 1.452Session Traffic 2238126 2.520 21424912 1.342

李博杰 [email protected] Visualizing Mirrors

Page 89: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Requests & Traffic among Fedora Subdirs

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

linuxepel

rpmfusion

webhuangou.php?id=1

js xampp

Jingdian

phpcms

Examples

abc-abc-abc-$%7Bprint(m

d5(base64d ecode(M

zYwd2Vic

?s=abc

abc,abc,abc,$%7Bprint(m

d5(base64d ecode(M

zYwd2Vic

mem

ber

managem

ent

index.php?-dautop repend

f ile%3d

toolsphpsso

s erver

go.php?a=

5Client

testing

5Server

betafedora-updates-USTC.m

irrors4.repo%20-O%

20

fedora-updates-USTC.mirrors6.repo%

20-O%20

repodata

e%5Bel

fedora-USTC.mirrors4.repo%

20-O%20

apelpub

NULLreleases

4 epeel

4ASdevelopm

ent

updates

6 5

Per

cent

ageRequest & Traffic of Subdirectories in fedora

RequestsTraffic

李博杰 [email protected] Visualizing Mirrors

Page 90: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Requests & Traffic among Fedora 2-level Subdirs

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

linux/updates

epel/5

linux/development

linux/releases

epel/6

rpmfusion/free

rpmfusion/nonfree

epel/5Server

epel/testing

NULLepel/4

NULLepel/beta

epel/4WS

epel/5Client

epel/4AS

epel/4ES

NULLreleases/12

linux/core

NULLlinux/extras

NULLreleases/17

NULLNULL

epel/x866 4

NULLlinux/s

releases/test

NULLNULL

NULLlinux/developem

ent

NULLNULL

NULLNULL

NULLNULL

Per

cent

ageRequest & Traffic of Subdirectories in fedora

RequestsTraffic

李博杰 [email protected] Visualizing Mirrors

Page 91: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Ubuntu Basic Stat

Value % of total RankRequests 19193451 5.8% 5Traffic 5653.1 GB 15.3% 3Files 608361 5.8% 5FileSize 584.6 GB 4.6% 6Sessions 160800 0.4% 4

李博杰 [email protected] Visualizing Mirrors

Page 92: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Ubuntu Basic Stat - Average

Average Ratio Std.Dev RatioRequest Length 316251 2.626 2950601 0.964File Size 1100038 0.806 8926254 0.377File Requests 22.2714 1.222 860.6907 0.103File Traffic 8152155 3.517 551019922 0.798Session Duration 636.37 0.693 1053.94 0.701Session Requests 110.93 14.788 294.68 1.702Session Traffic 34260641 38.570 101247464 6.341

李博杰 [email protected] Visualizing Mirrors

Page 93: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Requests & Traffic among Ubuntu Subdirs

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

pooldists

NULLubuntu

intrepid-security

intrepid-updates

%20intrepid-

%20natty

archive

ubuntu-cn

%20gutsy%

20main%

20multiverse%

20restricted%20univer

nattyjaunty

%20gutsy%

20main%

20

linuxm

averick

%20precise

repodata

feisty-backports

feisty-proposed

feisty-security

feisty-updates

%20natty-proposed

%20natty-security

intrepid-proposed

intrepid-backports

intrepid

updates

neverexistsa

karmic-updates

lucid-updates

lucid-security

lucid-proposed

lucid-backports

lucid%

20gutsy%20m

ain%20m

ultiverse%20restricted%

20univer

feistyprecise-updates

precise-security

precise-proposed

Per

cent

ageRequest & Traffic of Subdirectories in ubuntu

RequestsTraffic

李博杰 [email protected] Visualizing Mirrors

Page 94: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Requests & Traffic among Ubuntu 2-level Subdirs

0

0.1

0.2

0.3

0.4

0.5

0.6

pool/main

pool/universe

dists/precise

dists/lucid

dists/oneiric

dists/natty

pool/restricted

pool/multiverse

dists/hardy

dists/maverick

dists/precise-updates

dists/lucid-updates

dists/quantal

dists/oneiric-updates

dists/lucid-security

dists/precise-security

dists/precise-proposed

dists/natty-updates

NULLdists/oneiric-security

dists/natty-security

NULLdists/hardy-updates

dists/maverick-updates

dists/lucid-proposed

dists/precise-backports

dists/oneiric-proposed

dists/hardy-security

dists/lucid-backports

dists/maverick-security

dists/natty-proposed

dists/hardy-backports

dists/oneiric-backports

ubuntu/pool

dists/maverick-proposed

ubuntu/dists

dists/natty-backports

NULLdists/hardy-proposed

dists/maverick-backports

Per

cent

ageRequest & Traffic of Subdirectories in ubuntu

RequestsTraffic

李博杰 [email protected] Visualizing Mirrors

Page 95: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Eclipse Basic Stat

Value % of total RankRequests 51279473 15.6% 3Traffic 11498.6 GB 31.2% 1Files 263355 2.5% 8FileSize 161.1 GB 1.3% 14Sessions 524586 1.2% 3

李博杰 [email protected] Visualizing Mirrors

Page 96: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Eclipse Basic Stat - Average

Average Ratio Std.Dev RatioRequest Length 240769 2.000 5813562 1.901File Size 700002 0.513 8457323 0.357File Requests 113.5340 6.233 17894.2 2.137File Traffic 28156070 12.148 4197033259 6.079Session Duration 673.17 0.733 1059.08 0.704Session Requests 97.1089 12.944 782.80 4.520Session Traffic 22130605 24.914 62746610 3.930

李博杰 [email protected] Visualizing Mirrors

Page 97: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Requests & Traffic among Eclipse Subdirs

0

0.1

0.2

0.3

0.4

0.5

0.6

technology

eclipse

releases

toolsm

ylyn

webtools

archbirt

mat

modeling

windowbuilder

rt hudson

tptpkoneki

edtvirgo

datatools

e4 equinox

orionjetty

tm managem

ent

facetscout

eclipse.org-comm

on

m2e-wtp

java-ee-config

gyrexgraphiti

bpmn2-m

odeler

jpa stemm

pcegf

stp ztimegem

ini

NULL

Per

cent

ageRequest & Traffic of Subdirectories in eclipse

RequestsTraffic

李博杰 [email protected] Visualizing Mirrors

Page 98: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Requests & Traffic among Eclipse 2-level Subdirs

0

0.1

0.2

0.3

0.4

0.5

0.6

technology/epp

eclipse/downloads

releases/indigo

eclipse/updates

releases/juno

tools/cdt

mylyn/drops

webtools/downloads

NULLreleases/helios

birt/downloads

releases/galileo

mat/1.1.1

windowbuilder/WB

mat/1.2.0

modeling/tm

f

modeling/em

f

releases/ganymede

hudson/war

modeling/m

dt

edt/releases

technology/subversive

tptp/4.7.2

virgo/release

koneki/products

birt/update-site

rt/eclipselink

tools/aspectj

tools/gef

equinox/drops

datatools/updates

datatools/downloads

e4/sdk

orion/drops

tools/pdt

rt/rapwebtools/updates

technology/epf

e4/downloads

tools/orbit

Per

cent

ageRequest & Traffic of Subdirectories in eclipse

RequestsTraffic

李博杰 [email protected] Visualizing Mirrors

Page 99: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Outline...1 Requests & Traffic

By TimeBy IPBy Other Measures

...2 FilesFiles CharacteristicsHow Files Are Requested

...3 Sessions

...4 Distributions InsightCentOSFedoraUbuntuEclipse

...5 Technical Details

...6 Query Optimization

李博杰 [email protected] Visualizing Mirrors

Page 100: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Data Source

Nginx access log of mirrors.ustc.edu.cnFrom 2012-05-22 to 2012-07-12, 51 days4041MB compressed, 62383MB decompressedThanks Guo JiaHua for providing data

File list of mirrors.ustc.edu.cn crawled by spiderFTP for CPAN and CRANHTTP other directoriesNeed to detect symlinks to parent dir (e.g./ubuntu/ubuntu/…)

Scripts are written in bash and PHP.All scripts are available at GitHub:https://github.com/bojieli/mirrors-log

李博杰 [email protected] Visualizing Mirrors

Page 101: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Saving Data in BRIGHTHOUSE

Scanning through such amount of data is time-consuming.And the data is not too large to fit in a relational database.I tried InfoBright and InfiniDB with 6.4 ∗ 106 rows of artificialdata, InfiniDB is faster in queries, while InfoBright takes muchless disk space.

InfoBright’s compress rate is no less than gzip: 4316MB tablesize, compared to 4041MB gzip, not to mention that I haveadded some additional rows for faster statistics.I do not have much disk space, so I choose InfoBright (abackend of MySQL).Most of the queries I used (mostly GROUP BY, WHERE) takeless than 2 minutes.

Create two FIFOs foreach log: zcat logfile > php(preprocessing) > mysql-ib LOAD DATA INFILE

李博杰 [email protected] Visualizing Mirrors

Page 102: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Preprocessing

Make queries faster (I’m afraid of full-table scan)ip => integers: ipv4 ipv4_0 ipv4_1 ipv4_2 ipv4_3 ipv6_0ipv6_1 ipv6_2 ipv6_3time => integers: time year yearday weekday daymin daytimehourstatus (200, 403, 404…)length (filesize)url => substrings: url_0 …url_9 filename extensionreferer (I do not want to analyze it, for Mirrors is not a site ofuser-interaction)ua => ua_0 (Mozilla, Ubuntu…), ua (full)

李博杰 [email protected] Visualizing Mirrors

Page 103: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Preprocessing

PHP is slow …only 3MB/s.At first preg_match (regular exps) take 85% time, then Ioptimized the regexp and it only takes 25% now.Xdebug show that stream_get_line (fgets) and fputs takeabout 50% of total time.InfoBright’s data loading speed is 15MB/s (crawl_http.log). Idon’t think PHP’s preprocessing work is harder thandatabase’s…Maybe PHP’s interpretive nature makes it much slower thanC. Anyone give a benchmark for Python etc?

李博杰 [email protected] Visualizing Mirrors

Page 104: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Files Table

Many files on mirrors are never accessed, so we have to makea full list of files on mirrors.Preprocessed url, filename, extension and filesize are recordedfor each file.When processing logs and files, escape characters andVARCHAR max length should be taken care of, and filenameshould not be limited to a simple regular expression, sincethere are always exceptions: UTF-8 strings in maliciousrequest, ”,v” files in CVS …

李博杰 [email protected] Visualizing Mirrors

Page 105: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. GNUPLOT

GNUPLOT is pretty flexible in plotting, the documentationand online demos are consulted many times.However GNUPLOT is not flexible in processing data. It isstrong type, where integers and strings need to be explicitlyconverted. And the integer type is limited to −231 231 − 1.When the query result needs to be postprocessed, I write asimple sed s/regexp/replace/ for simple replacement, or awkNR%n==0 for sampling. When it comes to accumulation orsome complex stuff, a PHP script goes between.

李博杰 [email protected] Visualizing Mirrors

Page 106: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Discovering Sessions

Sequentially scan the log. If a request can fill into an existingsession, update it; otherwise create a new one and flush thetimeout session if exists.Garbage collection of timeout sessions: if (count(array) >gc_limit) { unset timeout sessions; gc_limit = count(array) *1.5; }

Inspired by JavaScript GC algorithm in IE7: If recycledmemory is less than 15% of total, then limit is doubled; ifrecycled memory is more than 85%, then reset to initial value.

Maintain a query buffer of 10000 rows to reduce query num.The PHP takes 4 hours, 15 times slower than C (see thefollowing section).

李博杰 [email protected] Visualizing Mirrors

Page 107: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Some Bugs Due to DIE Time Functions (PHP)

PHP: Find some sessions with negative session length. Theonly explanation is that logs are not in time increasing order.Dive into the log table, only to find that a timestamp matchesrecords in both April 31 and May 1.In fact, mktime() accepts 5 parameters just like mktime(structtime_t) in C, while its month is 1 to 12, different from 0 to11. However strptime() is a simple encapsulation of its Crespective.I dare not to use DATETIME in MySQL, because thearithmetic of such timestamps is tricky, and I would ratherimplement it in SQL or PHP.Thanks Godness, no timezone problem this time.

李博杰 [email protected] Visualizing Mirrors

Page 108: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Some Bugs Due to DIE Time Functions (C)

C: malloc() fall into deadlock, so strange, I GDBed an eveningand no answer. Change some code and accidentally gotSegmentation Error in time().In fact, time() need a parameter of type time_t and it isdynamically linked. I used it without any parameter, andtime() treats the garbage on stack as its parameter. If it isNULL, nothing happens; if not, the position is considered astruct time_t and unpredictable stuff happen.

李博杰 [email protected] Visualizing Mirrors

Page 109: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Outline...1 Requests & Traffic

By TimeBy IPBy Other Measures

...2 FilesFiles CharacteristicsHow Files Are Requested

...3 Sessions

...4 Distributions InsightCentOSFedoraUbuntuEclipse

...5 Technical Details

...6 Query Optimization

李博杰 [email protected] Visualizing Mirrors

Page 110: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Query Optimization

select count(*), ipv4, count(*) as c, sum(length) as s,concat(ipv4_0,'.',ipv4_1,'.',ipv4_2,'.',ipv4_3) from logwhere ipv4 is not null group by ipv4 order by s limit 40;Query_time: 798.179622

2012-07-30 16:32:12 Cnd(0): VC:0(t0a0) IS NOT NULL (0)2012-07-30 16:33:02 Aggregating: 318575688 tuples left.2012-07-30 16:45:29 Aggregated (1415554 gr). Omittedpackrows: 0 + 0 partially, out of 5020 total.2012-07-30 16:45:29 Heap Sort initialized for 1415554 rows,8+61 bytes each.2012-07-30 16:45:30 Total data packs actually loaded(approx.): 35139

Infobright is column-based, so cross-column queries are slow.Try to generate IP string from ‘ipv4’ field.

李博杰 [email protected] Visualizing Mirrors

Page 111: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Query Optimization (continued)

SELECT ipv4, COUNT(*)/$COUNT, SUM(length)/$LENGTH AS c,CONCAT((ipv4 & 255<<24)>>24, '.', (ipv4 & 255<<16)>>16, '.',(ipv4 & 255<<8)>>8, '.', (ipv4 & 255)) FROM log WHERE ipv4IS NOT NULL GROUP BY ipv4 ORDER BY c DESC LIMIT 40;Query_time: 585.069205InfoBright packs column data into packages, andpre-computed MAX, MIN, AVG and GROUP data for eachpackage. WHERE clause requires re-computation of thesedata:

2012-07-30 14:33:06 Cnd(0): VC:0(t0a0) IS NOT NULL (0)2012-07-30 14:33:55 Aggregating: 318575688 tuples left.2012-07-30 14:42:47 Generating output.

李博杰 [email protected] Visualizing Mirrors

Page 112: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Query Optimization (continued)

SELECT ipv4, COUNT(*)/$COUNT, SUM(length)/$LENGTH AS c,CONCAT((ipv4 & 255<<24)>>24, '.', (ipv4 & 255<<16)>>16, '.',(ipv4 & 255<<8)>>8, '.', (ipv4 & 255)) FROM log GROUP BYipv4 ORDER BY c DESC LIMIT 40;Query_time: 328.580979WHERE clause is removed, but the result is wrong (includes alarge NULL which stands for IPv6).

2012-07-30 14:45:41 Unoptimized expression near ’/’2012-07-30 14:45:41 Unoptimized expression near ’/’2012-07-30 14:45:41 Unoptimized expression near ’concat’2012-07-30 14:45:41 Aggregating: 328976877 tuples left.

InfoBright calculates these columns for each tuple, no wonderit is slow. Keep the core subquery small. Moving thesecalculation outside may help.

李博杰 [email protected] Visualizing Mirrors

Page 113: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Query Optimization (continued)

SELECT ipv4, c/$COUNT, s/$LENGTH, CONCAT((ipv4 &255<<24)>>24, '.', (ipv4 & 255<<16)>>16, '.', (ipv4 &255<<8)>>8, '.', (ipv4 & 255)) FROM (SELECT ipv4,COUNT(*) AS c, SUM(length) AS s FROM log GROUP BY ipv4HAVING (ipv4 IS NOT NULL) ORDER BY s DESC LIMIT 40) AS t;Query_time: 138.297943

Total data packs actually loaded (approx.): 10040

Use HAVING clause to filter NULL. The internal exec order is:FROM TABLE, JOIN, OUTER JOIN, WHERE, SELECTclause, GROUP BY, HAVING, ORDER BY, LIMIT, projection& output.Turn on MySQL’s query_cache: identical queries should notbe re-executed if I change some GNUPLOT command and runthe script again.Is there any way to make the query faster? I don’t know.

李博杰 [email protected] Visualizing Mirrors

Page 114: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Query Optimization (continued)

Another example of moving expressions ‘outside’ inner query:SELECT ipv4_0, c/$COUNT, s/$LENGTH FROM (SELECTIFNULL(ipv4_0, 'IPv6') AS ipv4_0, COUNT(*) AS c,SUM(length) AS s FROM log GROUP BY ipv4_0 ORDER BY sDESC LIMIT 40) AS t;Query_time: 263.048662SELECT IFNULL(ipv4_0, 'IPv6'), c/$COUNT, s/$LENGTHFROM (SELECT ipv4_0, COUNT(*) AS c, SUM(length) AS sFROM log GROUP BY ipv4_0 ORDER BY s DESC LIMIT 40) AS t;Query_time: 84.589837I cannot believe my eyes at the first sight of these figures.

李博杰 [email protected] Visualizing Mirrors

Page 115: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Query Optimization is Not EverythingSELECT url_0, COUNT(*)/$COUNT, SUM(files.filesize)/$LENGTHAS c FROM files WHERE NOT EXISTS (SELECT * FROM log WHERElog.filename = files.filename) GROUP BY url_0 ORDER BY cDESC LIMIT 40;The query ran 12 hours and I have to kill it. Infobright treatsNOT EXISTS clause as dependent subquery and might haveto execute it for each row.I don’t know how to use cursors and storage procedure inMySQL, and more importantly Infobright does not supportdynamic update of tables, so SELECT INTO is impossible.I write a C program (using mysqlclient API) to countoccurences of each url in log and write it into another table.Performance:

10M rows in files table, 329M rows in log table17.4 minutes in total440000 rows per second at Probe phase460 MB of memory usage (360MB resident)

李博杰 [email protected] Visualizing Mirrors

Page 116: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Simulating Hash JOIN

I found my algorithm actually named Hash JOIN afterprogramming finished. It is a fast JOIN algorithm forRDBMS. The task is to find, for each distinct value of the joinattribute, the set of tuples in each relation which have thatvalue. (Wikipedia)My work is to count the occurence in log table of each url infiles table.

Build step: Traverse files table and add the url field of eachrow to a hash list.Probe step: Traverse log table and add the counter of thecorresponding hash slot.Output step: Traverse files table again and output thecorresponding counters. The output is FIFOed to LOAD DATAINFILE.

Easy to shard horizontally. Steps are similar to Map, Shuffleand Reduce.

李博杰 [email protected] Visualizing Mirrors

Page 117: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. Simulating Hash JOIN (continued)

Since there might be many files (currently only about 10M),storing the hash list entirely in memory is my top concern.Use one hash for position, two additional hashes for checking,and do not store the original string. 10−22 probability ofundetected hash collision, but it is low enough for an analyticprogram.Only requires 12 bytes for each slot (2 * sizeof(int) hashcheck, sizeof(int) counter)Hash algorithm: DJBX33A (hash = ((hash«5) + hash) +*key++), used by PHP, Apache and more. Use three seedsfor position and check hash.Hash collision: Linearly find the next empty slot. This issimple (I’m afraid of pointers) and memory-saving.

李博杰 [email protected] Visualizing Mirrors

Page 118: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. The End

I intended to finish it in 3 days, but because I’m unfamiliarwith these tools, it takes me a week (or more if includingpreparation time).Access logs are the origin of discoveries on access patterns.Data is precious, while disk is cheap. Please do not deletethem after 52 days.I cannot draw conclusion on ‘trends’, for there is not data foran adequately long time.If you want other statistics, please email me and I will query it(if I have time :).

李博杰 [email protected] Visualizing Mirrors

Page 119: Visualizing Mirrors - USTC

..........

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

.....

.....

......

.....

......

.....

.....

.

.. 写在最后

Thanks all maintainers and supporters of mirrors.ustc.edu.cn!All scripts and source of this slides are available at GitHub:https://github.com/bojieli/mirrors-log终于搞定了中文字体问题,不过懒得翻译了……

李博杰 [email protected] Visualizing Mirrors


Recommended