Analyse your SEO Data with R and Kibana

Date post: 10-Jan-2017
Upload: vincent-terrasi
View: 1,631 times
Download: 4 times
Analyse your SEO Data with R and Kibana June 10th, 2016 Vincent Terrasi
Analyse your SEO Data with R and Kibana

June 10th, 2016

Vincent Terrasi

Vincent Terrasi


SEO Director - Groupe M6Web CuisineAZ, PasseportSanté, MeteoCity, …


Join the OVH adventure in July 2016

Blog : data-seo.com

Mission : Do a Real-Time Log Analysis Tool

1. Using Screaming Frog to crawl a website

2. Using R for SEO Analysis

3. Using PaasLogs to centralize logs

4. Using Kibana to build fancy dashboards

5. Test !


“The world is full of obvious things which nobody by any chance ever observes.”

Sherlock Holmes Quote

Real-Time Log Analysis Tool 4

• Screaming Frog

• Google Analytics

• R Crawler

• IIS Logs

• Apache Logs

• Nginx Logs Logs

Using Screaming Frog

Screaming Frog : Export Data 6

When the crawl is

finished, click the

export button and save

the XLSX file

Add your url and click

the start button

Screaming Frog : Data ! 7



"Status Code"


"Title 1"

"Title 1 Length"

"Title 1 Pixel Width"

"Title 2"

"Title 2 Length"

"Title 2 Pixel Width"

"Meta Description 1"

"Meta Description 1 Length“

"Meta Description 1 Pixel Width"

"Meta Keyword 1"

"Meta Keywords 1 Length"


"H1-1 length"


"H2-1 length"


"H2-2 length"

"Meta Robots 1“

"Meta Refresh 1"

"Canonical Link Element 1"


"Word Count"




"External Outlinks"


"Response Time"

"Last Modified"

"Redirect URI“

"GA Sessions"

"GA % New Sessions"

"GA New Users"

"GA Bounce Rate"

"GA Page Views Per Sesssion"

"GA Avg Session Duration"

"GA Page Value"

"GA Goal Conversion Rate All"

"GA Goal Completions All"

"GA Goal Value All"






"H1-2 length"

Using R

Why R ?


Big Community

Mac / PC / Unix

Open Source

7500 packages



WheRe ? How ?



Rgui RStudio

Using R : Step 1

Export All Urls




Packages :





R Examples

Crawl via Screaming Frog

Classify URLs by : Section

Load Time

Number of Inlinks

Detect Active Pages Min 1 visit per month

Detect Compliant Pages Canonical Not Equal

Meta No-index

Bad HTTP Status Code

Detect Duplicate Meta


R : read files 13

# Read xlsx file

urls <- read_excel("internal_html_blog.xlsx",

sheet = 1,

col_names = TRUE,


# Read csv file

urls <- read.csv2("internal_html_blog.csv", sep=";", header = TRUE)

Detect Active Pages 14


urls_select$Active <- FALSE

urls_select$Active[ which(urls_select$`GA Sessions` > 0) ] <- TRUE


urls_select$Active <- as.factor(urls_select$Active)

Classify URLs by Section 15

schemas <- read.csv(“conf.csv”,header = FALSE, col.names = "schema", stringsAsFactors = FALSE)

urls_select$Cat <- "no match"

for (j in 1:length(schemas))


urls_select$Cat[ which(stri_detect_fixed(urls_select$Address , schemas[j]) ) ] <- schemas[j]








Classify URLs By Load Time 16

urls_select$Speed <- NA

urls_select$Speed[ which(urls_select$`Response Time` < 0.501 ) ] <- "Fast“

urls_select$Speed [ which(urls_select$`Response Time` >= 0.501

& urls_select$`Response Time` < 1.001) ] <- "Medium“

urls_select$Speed[ which(urls_select$`Response Time` >= 1.001

& urls_select$`Response Time` < 2.001) ] <- "Slow“

urls_select$Speed[ which(urls_select$`Response Time` >= 2.001) ] <- "Slowest"

urls_select$Speed <- as.factor(urls_select$Speed)

Classify URLs By Number of Inlinks 17

urls_select$`Group Inlinks` <- "URLs with No Follow Inlinks"

urls_select$`Group Inlinks`[ which(urls_select$`Inlinks` < 1 ) ] <- "URLs with No Follow Inlinks"

urls_select$`Group Inlinks`[ which(urls_select$`Inlinks` == 1 ) ] <- "URLs with 1 Follow Inlink“

urls_select$`Group Inlinks`[ which(urls_select$`Inlinks` > 1

& urls_select$`Inlinks` < 6) ] <- "URLs with 2 to 5 Follow Inlinks“

urls_select$`Group Inlinks`[ which(urls_select$`Inlinks` >= 6

& urls_select$`Inlinks` < 11 ) ] <- "URLs with 5 to 10 Follow Inlinks“

urls_select$`Group Inlinks`[ which(urls_select$`Inlinks` >= 11) ] <- "URLs with more than 10 Follow Inlinks"

urls_select$`Group Inlinks` <- as.factor(urls_select$`Group Inlinks`)

Detect Compliant Pages 18

# Compliant Pages

# Canonical Not Equal

# Meta No-index

# Bad HTTP Status Code

# Not Equal

urls_select$Compliant <- TRUE

urls_select$Compliant[ which(urls_select$`Status Code` != 200

| urls_select$`Canonical Link Element 1` != urls_select$Address

| urls_select$Status != "OK"

| grepl("noindex",urls_select$`Meta Robots 1`)

) ] <- FALSE

urls_select$Compliant <- as.factor(urls_select$Compliant)

Detect Duplicata Meta 19

urls_select$`Status Title` <- 'Unique'

urls_select$`Status Title`[ which(urls_select$`Title 1 Length` == 0) ] <- "No Set"

urls_select$`Status Description` <- 'Unique'

urls_select$`Status Description`[ which(urls_select$`Meta Description 1 Length` == 0) ] <- "No Set"

urls_select$`Status H1` <- 'Unique'

urls_select$`Status H1`[ which(urls_select$`H1-1 Length` == 0) ] <- "No Set"

urls_select$`Status Title`[ which(duplicated(urls_select$`Title 1`)) ] <- 'Duplicate'

urls_select$`Status Description`[ which(duplicated(urls_select$`Meta Description 1`)) ] <- 'Duplicate'

urls_select$`Status H1`[ which(duplicated(urls_select$`H1-1`)) ] <- 'Duplicate'

urls_select$`Status Title` <- as.factor(urls_select$`Status Title`)

urls_select$`Status Description` <- as.factor(urls_select$`Status Description`)

urls_select$`Status H1` <- as.factor(urls_select$`Status H1`)

Generate CSV 20

urls_light <- select(urls_select,Address,Cat,Active,Speed,Compliant,Level,Inlinks) %>%


colnames(urls_light) <- c("request","section","active","speed","compliant","depth","inlinks")

write.csv2(“file.csv”, filename, row.names = FALSE)

Package dplyr : select and mutate

Edit colnames

Use write.csv2

R : ggplot2 command 21


Create the ggplot object and populate it with data (always a data frame)

ggplot( mydata, aes( x=section,y=count, fill=active ))


Add layer(s)

+ geom_point()


Used for conditionning on variable(s)

+ facet_grid(~rescode)

ggplot2 : Geometry 22

R Chart : Active Pages 23

urls_level_active <- group_by(urls_select,Level,Active) %>%

summarise(count = n()) %>%


Geometry Aesthetic

p <- ggplot(urls_level_active, aes(x=Level, y=count, fill=Active) ) +

geom_bar(stat = "identity", position = "stack") +

scale_fill_manual(values=c("#e5e500", "#4DBD33")) +

labs(x = "Depth", y ="Crawled URLs")



# save in file


R Chart : GA Sessions 24

urls_cat_gasessions <- aggregate( urls_select$`GA Sessions`, by=list(Cat=urls_select$Cat, urls_select$Compliant), FUN=sum, na.rm=TRUE)

colnames(urls_cat_gasessions) <- c("Category","Compliant","GA Sessions")

p <- ggplot(urls_cat_gasessions, aes(x=Category, y=`GA Sessions`, fill=Compliant))+

geom_bar(stat = "identity", position = "stack") +

theme(axis.text.x = element_text(angle = 90, hjust = 1)) +

labs(x = "Section", y ="Sessions") +




# save in file


R Chart : Compliant 25

urls_cat_compliant_statuscode <- group_by(urls_select,Cat,

Compliant,`Status Code`) %>%

summarise(count = n()) %>%

filter(grepl(200,`Status Code`) | grepl(301,`Status Code`))

p <- ggplot(urls_cat_compliant_statuscode, aes(x=Cat, y=count,

fill= Compliant ) ) +

geom_bar(stat = "identity", position = "stack") +

theme(axis.text.x = element_text(angle = 90, hjust = 1)) +

facet_grid(`Status Code` ~ .) +

labs(x = "Section", y ="Crawled URLs") +


R : SEO Cheat Sheet 26

Package Dplyr

select() allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions

mutate() a data frame by adding new or replacing existing columns

filter() allows you to select a subset of rows in a data frame.

Package Gplot2

aes - geom


Package Readxl




Architecture 28

Hard to monitor and optimize host server performance

Architecture 29

Using PaasLogs

PaasLogs 31

Page 32: Analyse your SEO Data with R and Kibana

PaasLogs 32

164 noeuds au sein du cluster Elastic Search

180 machines connectées

Entre 100 000 et 300 000 logs traités par seconde

12 milliards de logs transitent tous les jours

211 milliards de documents enregistrés

8 clicks and 3 copy/paste to use it !

PaasLogs: Step 1 33

PaasLogs : Step 2 34

Page 35: Analyse your SEO Data with R and Kibana

PaasLogs 35

PaasLogs : Streams 36

The Streams are the recipient of your logs. When you send a log with the

right stream token, it arrives automatically to your stream in a awesome

software named Graylog.

PaasLogs : Dashboards 37

The Dashboard is the global view of your logs, A Dashboard is an efficient

way to exploit your logs and to view global information like metrics and

trends about your data without being overwhelmed by the logs details.

PaasLogs : Aliases 38

The Aliases will allow you to access directly your data from your Kibana or

using an Elasticsearch query


PaasLogs : Inputs 39

The Inputs will allow you to ask OVH to host your own dedicated collector

like Logstash or Flowgger.

PaasLogs : Network Configuration 40

PaasLogs : Plugins Logstash 41

OVHCOMMONAPACHELOG %{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\]

"(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion_num:float})?|%{DATA:rawrequest})"

%{NUMBER:response_int:int} (?:%{NUMBER:bytes_int:int}|-)


PaasLogs : Config Logstash 42

if [type] == "apache" {

grok {

match => [ "message", "%{OVHCOMBINEDAPACHELOG}"]

patterns_dir => "/opt/logstash/patterns"



if [type] == "csv_infos" {

csv {

columns => ["request", "section","active", "speed",


separator => ";"



How to send Logs to PaasLogs ? 43

Use Filebeat 44

Page 45: Analyse your SEO Data with R and Kibana

Filebeat : Install 45

Install filebeat

curl -L -O https://download.elastic.co/beats/filebeat/filebeat_1.2.1_amd64.deb

sudo dpkg -i filebeat_1.2.1_amd64.deb

OU https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html

nano /etc/filebeat/filebeat.yml

Change filebat.yml

> Indiquer le chemin du fichier de logs

> Mettre le chemin : c002-5717e1b5d2ee5e00095cea38.in.laas.runabove.com

Copie Key.crt


sudo /etc/init.d/filebeat start

sudo /etc/init.d/filebeat stop

7/ get logs si erreur

tail -f /var/log/filebeat.log

Install filebeat

curl -L -O https://download.elastic.co/beats/filebeat/filebeat_1.2.1_amd64.deb

sudo dpkg -i filebeat_1.2.1_amd64.deb


Filebeat : Edit filebeat.yml 46

Install filebeat

curl -L -O https://download.elastic.co/beats/filebeat/filebeat_1.2.1_amd64.deb

sudo dpkg -i filebeat_1.2.1_amd64.deb

OU https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html

nano /etc/filebeat/filebeat.yml

Change filebat.yml

> Indiquer le chemin du fichier de logs

> Mettre le chemin : c002-5717e1b5d2ee5e00095cea38.in.laas.runabove.com

Copie Key.crt


sudo /etc/init.d/filebeat start

sudo /etc/init.d/filebeat stop

7/ get logs si erreur

tail -f /var/log/filebeat.log





- /home/ubuntu/lib/apache2/log/access.log

input_type: log

fields_under_root: true

document_type: apache



- /home/ubuntu/workspace/csv/crawled-urls-filebeat-*.csv

input_type: csv

fields_under_root: true

document_type: csv_infos



hosts: ["c002-5717e1b5d2ee5e00095cea38.in.laas.runabove.com:5044"]

worker: 1


certificate_authorities: ["/home/ubuntu/workspace/certificat/key.crt"]

Filebeat : Start 47

Install filebeat

curl -L -O https://download.elastic.co/beats/filebeat/filebeat_1.2.1_amd64.deb

sudo dpkg -i filebeat_1.2.1_amd64.deb

OU https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html

nano /etc/filebeat/filebeat.yml

Change filebat.yml

> Indiquer le chemin du fichier de logs

> Mettre le chemin : c002-5717e1b5d2ee5e00095cea38.in.laas.runabove.com

Copie Key.crt


sudo /etc/init.d/filebeat start

sudo /etc/init.d/filebeat stop

7/ get logs si erreur

tail -f /var/log/filebeat.log

Copy / Paste Key.crt








Start Filebeat

sudo /etc/init.d/filebeat start

sudo /etc/init.d/filebeat stop

How to combine multiple sources ? 48

Paaslogs : Plugins ES 49

Install filebeat

curl -L -O https://download.elastic.co/beats/filebeat/filebeat_1.2.1_amd64.deb

sudo dpkg -i filebeat_1.2.1_amd64.deb

OU https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html

nano /etc/filebeat/filebeat.yml

Change filebat.yml

> Indiquer le chemin du fichier de logs

> Mettre le chemin : c002-5717e1b5d2ee5e00095cea38.in.laas.runabove.com

Copie Key.crt


sudo /etc/init.d/filebeat start

sudo /etc/init.d/filebeat stop

7/ get logs si erreur

tail -f /var/log/filebeat.log

Description : Copies fields from previous log events in Elasticsearch to current events

if [type] == "apache" {

elasticsearch {

hosts => "laas.runabove.com"

index => "logsDataSEO" # alias

ssl => true

query => ‘ type:csv_infos AND request: "%{[request]}" ‘

fields => [["speed","speed"],["compliant","compliant"],





# TIP : fields => [[src,dest],[src,dest]]

Using Kibana

Kibana : Install 51

Install filebeat

curl -L -O https://download.elastic.co/beats/filebeat/filebeat_1.2.1_amd64.deb

sudo dpkg -i filebeat_1.2.1_amd64.deb

OU https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html

nano /etc/filebeat/filebeat.yml

Change filebat.yml

> Indiquer le chemin du fichier de logs

> Mettre le chemin : c002-5717e1b5d2ee5e00095cea38.in.laas.runabove.com

Copie Key.crt


sudo /etc/init.d/filebeat start

sudo /etc/init.d/filebeat stop

7/ get logs si erreur

tail -f /var/log/filebeat.log

Download Kibana 4.1

• Download and unzip Kibana 4

• Extract your archive

• Open config/kibana.yml in an editor

• Set the elasticsearch.url to point at your Elasticsearch instance

• Run ./bin/kibana (or bin\kibana.bat on Windows)

• Point your browser athttp://yourhost.com:5601

Kibana : Edit Kibana.yml 52

Update Kibana.xml

server.port: 8080

server.host: ""

elasticsearch.url: "https://laas.runabove.com:9200"

elasticsearch.preserveHost: true

kibana.index: "ra-logs-33078"

kibana.defaultAppId: "discover"

elasticsearch.username: "ra-logs-33078"

elasticsearch.password: "rHftest6APlolNcc6"

Kibana : Line Chart 53

Install filebeat

curl -L -O https://download.elastic.co/beats/filebeat/filebeat_1.2.1_amd64.deb

sudo dpkg -i filebeat_1.2.1_amd64.deb

OU https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html

nano /etc/filebeat/filebeat.yml

Change filebat.yml

> Indiquer le chemin du fichier de logs

> Mettre le chemin : c002-5717e1b5d2ee5e00095cea38.in.laas.runabove.com

Copie Key.crt


sudo /etc/init.d/filebeat start

sudo /etc/init.d/filebeat stop

7/ get logs si erreur

tail -f /var/log/filebeat.log

Number of active crawled from google over a period of time

Page 54: Analyse your SEO Data with R and Kibana

Kibana : Vertical Bar Chart 54

Install filebeat

curl -L -O https://download.elastic.co/beats/filebeat/filebeat_1.2.1_amd64.deb

sudo dpkg -i filebeat_1.2.1_amd64.deb

OU https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html

nano /etc/filebeat/filebeat.yml

Change filebat.yml

> Indiquer le chemin du fichier de logs

> Mettre le chemin : c002-5717e1b5d2ee5e00095cea38.in.laas.runabove.com

Copie Key.crt


sudo /etc/init.d/filebeat start

sudo /etc/init.d/filebeat stop

7/ get logs si erreur

tail -f /var/log/filebeat.log

Kibana : Pie Chart 55

How to compare two periods ? 56

Kibana : Use Date Range 57

Install filebeat

curl -L -O https://download.elastic.co/beats/filebeat/filebeat_1.2.1_amd64.deb

sudo dpkg -i filebeat_1.2.1_amd64.deb

OU https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html

nano /etc/filebeat/filebeat.yml

Change filebat.yml

> Indiquer le chemin du fichier de logs

> Mettre le chemin : c002-5717e1b5d2ee5e00095cea38.in.laas.runabove.com

Copie Key.crt


sudo /etc/init.d/filebeat start

sudo /etc/init.d/filebeat stop

7/ get logs si erreur

tail -f /var/log/filebeat.log

Final Architecture

PassLogs Kibana





Soft RealTime


Old Logs




HA Proxy

Test yourself 59

Use Screaming Frog Spider Tool


Teach R




Test PassLogs


Install Kibana


TODO List 60

- Create a GitHub Repository with all source code

- Add Plugin Logstash to do a reverse DNS lookup

- Schedule A Crawl By Command Line

- Upload Screaming Frog File to web server

Thank you

Keep in touch June 10th, 2016

@vincentterrasi Vincent Terrasi
