How our ISP cost us a full day of the entire R&D team - Lior Redlus - DevOpsDays Tel Aviv 2017

How our ISP cost us a full day of the entire R&D team

Lior Redlus

Co-founder and Chief Data Scientist

Coralogix

About Myself

• 32yr. Scientist at heart

• B.Sc and M.Sc in Neuroscience and Information Processing (BIU)

• Co-founder and Chief Data Scientist @ Coralogix

About Coralogix

• A Machine Learning powered scalable Log Analysis solution

• Log Management already included: indexing, querying, filtering, alerting etc.

• Coralogix Analytics:• Turns your data into patterns and flows

• Gives you deep insights on your system

• Automatically detects production problems

• Finds system behavior changes between code deployments

Interacting with your logging data

• Coralogix provides 3 ways to get insights from your logs:1. Coralogix Dashbaord – a simple and powerful dashboard with machine

learning capabilities

2. Elastic’s Kibana – with a rich query language and flexible visualizations

3. Elasticsearch API – for deep technical querying and aggregations

Good product, happy customers!

• Everything worked smoothly for months

• Until we got a call from a customer (0.5TB / day)

• Some of his heavier dashboards could not be loaded

• He was not happy

• And neither were we

Well, of course. This makes no one happy!

• The error message was replicated in our offices as well

Kibana – technical overview

Port 5601

Node.jsserver

Angular.jsclient

localhost

Docker container

Docker container

Docker container

Our proprietary Kibana proxy:• Emulates elasticsearch for Kibana• Confines customers to only

access their data• Parses queries for various SLA

restrictions

Port9200

Port9200

Customer

Pu

blic

do

mai

n

So what could have gone wrong..?

• We looked into everything we could think of:

• Was the customer’s dashboard defined properly?

• It was.

• Was any indexed elasticsearch data corrupted?

• No.

• Was a large Kibana dashboard overloading our Kibana Proxy?

• Not according to the CPU and memory monitoring.

• Was there a hidden bug in our Kibana Proxy for certain queries?

• Replies seemed to be correct for every query we researched.

• Was any Docker container replaced recently, possibly with different settings?

• Yes, but new settings were not introduced.

• Was any Docker networking bug (and there are many…) interacting here?

• Not any that we could find.

Everything looked perfect!

• However, we did have one odd finding:• When we were connected to our VPN, all the problems disappeared!

• Late at night and disappointed, we decided to call it a day:

Connecting the dots…

• Returning home, we each loaded the dashboard, and to our surprise – everything worked!

• The same ISP served us and the customer, but not our homes.

• The new suspect was our Internet Service Provider!

Results – 1

• The next day, confident, we experimented:• SSL vs. no SSL

• Kibana’s standard port 5601 vs. 443 https port

• Adding our Kibana CNAME to Cloudflare

• The results were staggering!

Results – 2• Loading the dashboard without SSL through port 5601:

• Loading the same dashboard with SSL through port 443 and Cloudflare:

Results, solution and conclusion

• The ISP was throttling our requests, causing timeouts and packet losses – eventually crashing heavy-loaded dashboards

• Adding our Kibana to Cloudflare under port 443 solved our problems• (aside from wasting a whole day of our R&D team!)

• Conclusion: trust no-one!

Questions?

• Please feel free to contact me directly:

Lior Redlus, Chief Data Scientist, [email protected]

One month free trial @ http://www.coralogix.com

mailto:[email protected]

Date post:	22-Jan-2018
Category:	Technology
Upload:	devopsdays-tel-aviv
View:	114 times
Download:	0 times

How our ISP cost us a full day of the entire R&D team - Lior Redlus - DevOpsDays Tel Aviv 2017

Technology