Date post: | 22-Jan-2018 |
Category: |
Technology |
Upload: | devopsdays-tel-aviv |
View: | 114 times |
Download: | 0 times |
How our ISP cost us a full day of the entire R&D team
Lior Redlus
Co-founder and Chief Data Scientist
Coralogix
About Myself
• 32yr. Scientist at heart
• B.Sc and M.Sc in Neuroscience and Information Processing (BIU)
• Co-founder and Chief Data Scientist @ Coralogix
About Coralogix
• A Machine Learning powered scalable Log Analysis solution
• Log Management already included: indexing, querying, filtering, alerting etc.
• Coralogix Analytics:• Turns your data into patterns and flows
• Gives you deep insights on your system
• Automatically detects production problems
• Finds system behavior changes between code deployments
Interacting with your logging data
• Coralogix provides 3 ways to get insights from your logs:1. Coralogix Dashbaord – a simple and powerful dashboard with machine
learning capabilities
2. Elastic’s Kibana – with a rich query language and flexible visualizations
3. Elasticsearch API – for deep technical querying and aggregations
Good product, happy customers!
• Everything worked smoothly for months
• Until we got a call from a customer (0.5TB / day)
• Some of his heavier dashboards could not be loaded
• He was not happy
• And neither were we
Well, of course. This makes no one happy!
• The error message was replicated in our offices as well
Kibana – technical overview
Port 5601
Node.jsserver
Angular.jsclient
localhost
Docker container
Docker container
Docker container
Our proprietary Kibana proxy:• Emulates elasticsearch for Kibana• Confines customers to only
access their data• Parses queries for various SLA
restrictions
Port9200
Port9200
Customer
Pu
blic
do
mai
n
So what could have gone wrong..?
• We looked into everything we could think of:
• Was the customer’s dashboard defined properly?
• It was.
• Was any indexed elasticsearch data corrupted?
• No.
• Was a large Kibana dashboard overloading our Kibana Proxy?
• Not according to the CPU and memory monitoring.
• Was there a hidden bug in our Kibana Proxy for certain queries?
• Replies seemed to be correct for every query we researched.
• Was any Docker container replaced recently, possibly with different settings?
• Yes, but new settings were not introduced.
• Was any Docker networking bug (and there are many…) interacting here?
• Not any that we could find.
Everything looked perfect!
• However, we did have one odd finding:• When we were connected to our VPN, all the problems disappeared!
• Late at night and disappointed, we decided to call it a day:
Connecting the dots…
• Returning home, we each loaded the dashboard, and to our surprise – everything worked!
• The same ISP served us and the customer, but not our homes.
• The new suspect was our Internet Service Provider!
Results – 1
• The next day, confident, we experimented:• SSL vs. no SSL
• Kibana’s standard port 5601 vs. 443 https port
• Adding our Kibana CNAME to Cloudflare
• The results were staggering!
Results – 2• Loading the dashboard without SSL through port 5601:
• Loading the same dashboard with SSL through port 443 and Cloudflare:
Results, solution and conclusion
• The ISP was throttling our requests, causing timeouts and packet losses – eventually crashing heavy-loaded dashboards
• Adding our Kibana to Cloudflare under port 443 solved our problems• (aside from wasting a whole day of our R&D team!)
• Conclusion: trust no-one!
Questions?
• Please feel free to contact me directly:
Lior Redlus, Chief Data Scientist, [email protected]
One month free trial @ http://www.coralogix.com