Date post: | 18-Jul-2015 |
Category: |
Software |
Upload: | christopher-batey |
View: | 758 times |
Download: | 3 times |
@chbatey
Fault tolerant microservices on the JVM
Christopher Batey DataStax @chbatey
@chbatey
Who am I?
• DataStax
- Technical Evangelist / Software Engineer
- Builds enterprise ready version of Apache Cassandra
• Sky: Building next generation Internet TV platform
• Lots of time working on a test double for Apache Cassandra
@chbatey
Agenda
•Setting the scene
-What do we mean by a fault?
-What is a micro(ish)service?
-Monolith application vs the micro(ish)service
•A worked example
-Identify an issue
-Reproduce/test it
-Show how to deal with the issue
@chbatey
So… what do applications look like?
@chbatey
So… what do applications look like?
@chbatey
So… what do applications look like?
@chbatey
So… what do applications look like?
@chbatey
So… what do applications look like?
@chbatey
Small horizontal scalable services
• Move to small services independently deployed
- Login service
- Device service
- etc
• Move to a horizontally scalable Database that can run active active in multiple data centres
@chbatey
So… what do applications look like?
@chbatey
So... what do systems look like now?
@chbatey
Pin Service
Movie Player
User Service
Device Service
Play Movie
Example: Movie player service
@chbatey
Time for an example...
•All examples are on github
•Technologies used:
-Dropwizard
-Spring Boot
-Wiremock
-Hystrix
-Graphite
-Saboteur
@chbatey
Testing microservices
• You don’t know a service is fault tolerant if you don’t test faults
@chbatey
The test double
Wiremock for HTTP integration Stubbed Cassandra for Database Kafka Unit
@chbatey
Isolated service tests
Movie service
Mocks User Device Pin service
Play Movie
Acceptance
Test
Prime
Real HTTP/TCP
@chbatey
Fault tolerance
1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off
@chbatey
1 - Don’t take forever
• If at first you don’t succeed, don’t take forever to tell someone
• Timeout and fail fast
@chbatey
Which timeouts?
• Socket connection timeout
• Socket read timeout
@chbatey
Your service hung for 30 seconds :(
Customer
You :(
@chbatey
Which timeouts?
• Socket connection timeout
• Socket read timeout
• Resource acquisition
@chbatey
Your service hung for 10 minutes :(
@chbatey
Let’s think about this
@chbatey
A little more detail
@chbatey
Adding a automated test
@chbatey
Adding a automated test
•Vagrant - launches + provisions local VMs
•Saboteur - uses tc, iptables to simulate network issues
•Wiremock - used to mock HTTP dependencies
•Cucumber - acceptance tests
@chbatey
I can write an automated test for that?
Wiremock: •User Service •Device Service •Pin Service
S a b o t e u r
Vagrant + Virtual box VM
Movie Service
Acceptance
prime to drop traffic
reset
@chbatey
Implementing reliable timeouts
@chbatey
Implementing reliable timeouts
• Protect the container thread!
• Homemade: Worker Queue + Thread pool (executor)
@chbatey
Implementing reliable timeouts
• Protect the container thread!
• Homemade: Worker Queue + Thread pool (executor)
• Hystrix
• Spring cloud Netflix
@chbatey
A simple Spring RestController
@RestControllerpublic class Resource { private static final Logger LOGGER = LoggerFactory.getLogger(Resource.class); @Autowired private ScaryDependency scaryDependency; @RequestMapping("/scary") public String callTheScaryDependency() { LOGGER.info("Resource later: I wonder which thread I am on!"); return scaryDependency.getScaryString(); }}
@chbatey
Scary dependency
@Componentpublic class ScaryDependency { private static final Logger LOGGER = LoggerFactory.getLogger(ScaryDependency.class); public String getScaryString() { LOGGER.info("Scary Dependency: I wonder which thread I am on! Tomcats?”); if (System.currentTimeMillis() % 2 == 0) { return "Scary String"; } else { Thread.sleep(5000) return “Slow Scary String"; } }}
@chbatey
All on the tomcat thread
13:47:20.200 [http-8080-exec-1] INFO info.batey.examples.Resource - Resource later: I wonder which thread I am on!13:47:20.200 [http-8080-exec-1] INFO info.batey.examples.ScaryDependency - Scary Dependency: I wonder which thread I am on! Tomcats?
@chbatey
Scary dependency@Componentpublic class ScaryDependency { private static final Logger LOGGER = LoggerFactory.getLogger(ScaryDependency.class); @HystrixCommand() public String getScaryString() { LOGGER.info("Scary Dependency: I wonder which thread I am on! Tomcats?”); if (System.currentTimeMillis() % 2 == 0) { return "Scary String"; } else { Thread.sleep(5000) return “Slow Scary String"; } }}
@chbatey
What an annotation can do...
13:51:21.513 [http-8080-exec-1] INFO info.batey.examples.Resource - Resource later: I wonder which thread I am on!13:51:21.614 [hystrix-ScaryDependency-1] INFO info.batey.examples.ScaryDependency - Scary Dependency: I wonder which thread I am on! Tomcats? :P
@chbatey
Async libraries are your friend
• DataStax Java Driver
- Guava ListenableFuture
@chbatey
Timeouts take home
• You can’t use network level timeouts for SLAs
• Test your SLAs - if someone says you can’t, hit them with a stick
• Scary things happen without network issues
@chbatey
Fault tolerance
1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off
@chbatey
2 - Don’t try if you can’t succeed
@chbatey
Complexity
“When an application grows in complexity it will eventually start sending emails”
@chbatey
Complexity
“When an application grows in complexity it will eventually start using queues and thread pools”
@chbatey
Don’t try if you can’t succeed
@chbatey
Don’t try if you can’t succeed
• Executor Unbounded queues :(
- newFixedThreadPool
- newSingleThreadExecutor
- newThreadCachedThreadPool
• Bound your queues and threads
• Fail quickly when the queue / maxPoolSize is met
• Know your drivers
@chbatey
This is a functional requirement
• Set the timeout very high
• Use Wiremock to add a large delay to the requests
@chbatey
This is a functional requirement
• Set the timeout very high
• Use Wiremock to add a large delay to the requests
• Set queue size and thread pool size to 1
• Send in 2 requests to use the thread and fill the queue
• What happens on the 3rd request?
@chbatey
Fault tolerance
1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off
@chbatey
3 - Fail gracefully
@chbatey
Expect rubbish
• Expect invalid HTTP
• Expect malformed response bodies
• Expect connection failures
• Expect huge / tiny responses
@chbatey
Testing with WiremockstubFor(get(urlEqualTo("/dependencyPath"))
.willReturn(aResponse()
.withFault(Fault.MALFORMED_RESPONSE_CHUNK)));
{ "request": { "method": "GET", "url": "/fault" }, "response": { "fault": "RANDOM_DATA_THEN_CLOSE" }
{ "request": { "method": "GET", "url": "/fault" }, "response": { "fault": "EMPTY_RESPONSE" } }
@chbatey
Stubbed Cassandra
@chbatey
Fault tolerance
1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off
@chbatey
4 - Know if it’s your fault
@chbatey
Record stuff
• Metrics:
- Timings
- Errors
- Concurrent incoming requests
- Thread pool statistics
- Connection pool statistics
• Logging: Boundary logging, ElasticSearch / Logstash
• Request identifiers
@chbatey
Zipkin from Twitter
@chbatey
Graphite + Codahale
@chbatey
Response times
@chbatey
Separate resource pools
• Don’t flood your dependencies
• Be able to answer the questions:
- How many connections will you make to dependency X?
- Are you getting close to your max connections?
@chbatey
So easy with Dropwizard + Hystrix
metrics: reporters: - type: graphite host: 192.168.10.120 port: 2003 prefix: shiny_app
metrics:
reporters:
- type: graphite
host: 192.168.10.120
port: 2003
prefix: shiny_app
@Overridepublic void initialize(Bootstrap<AppConfig> appConfigBootstrap) { HystrixCodaHaleMetricsPublisher metricsPublisher = new HystrixCodaHaleMetricsPublisher(appConfigBootstrap.getMetricRegistry()); HystrixPlugins.getInstance().registerMetricsPublisher(metricsPublisher);}
@chbatey
Fault tolerance
1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off
@chbatey
5 - Don’t whack a dead horse
Movie Player
User Service
Device Service
Play Movie
Pin Service
@chbatey
What to do…
• Yes this will happen…
• Mandatory dependency - fail *really* fast
• Throttling
• Fallbacks
@chbatey
Circuit breaker pattern
@chbatey
Implementation with Hystrix
@Path("integrate") public class IntegrationResource { private static final Logger LOGGER = LoggerFactory.getLogger(IntegrationResource.class); @GET @Timed public String integrate() { LOGGER.info("integrate"); String user = new UserServiceDependency(userService).execute(); String device = new DeviceServiceDependency(deviceService).execute(); Boolean pinCheck = new PinCheckDependency(pinService).execute(); return String.format("[User info: %s] \n[Device info: %s] \n[Pin check: %s] \n", user, device, pinCheck); }}
@chbatey
Implementation with Hystrix
public class PinCheckDependency extends HystrixCommand<Boolean> { private HttpClient httpClient; public PinCheckDependency(HttpClient httpClient) { super(HystrixCommandGroupKey.Factory.asKey("PinCheckService")); this.httpClient = httpClient; } @Override protected Boolean run() throws Exception { HttpGet pinCheck = new HttpGet("http://localhost:9090/pincheck"); HttpResponse pinCheckResponse = httpClient.execute(pinCheck); int statusCode = pinCheckResponse.getStatusLine().getStatusCode(); if (statusCode != 200) { throw new RuntimeException("Oh dear no pin check, status code " + statusCode); } String pinCheckInfo = EntityUtils.toString(pinCheckResponse.getEntity()); return Boolean.valueOf(pinCheckInfo); }}
@chbatey
Implementation with Hystrixpublic class PinCheckDependency extends HystrixCommand<Boolean> { private HttpClient httpClient; public PinCheckDependency(HttpClient httpClient) { super(HystrixCommandGroupKey.Factory.asKey("PinCheckService")); this.httpClient = httpClient; } @Override protected Boolean run() throws Exception { HttpGet pinCheck = new HttpGet("http://localhost:9090/pincheck"); HttpResponse pinCheckResponse = httpClient.execute(pinCheck); int statusCode = pinCheckResponse.getStatusLine().getStatusCode(); if (statusCode != 200) { throw new RuntimeException("Oh dear no pin check, status code " + statusCode); } String pinCheckInfo = EntityUtils.toString(pinCheckResponse.getEntity()); return Boolean.valueOf(pinCheckInfo); } @Override public Boolean getFallback() { return true; }}
@chbatey
Triggering the fallback
• Error threshold percentage
• Bucket of time for the percentage
• Minimum number of requests to trigger
• Time before trying a request again
• Disable
• Per instance statistics
@chbatey
Fault tolerance
1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off
@chbatey
6 - Turn off broken stuff
• The kill switch
@chbatey
To recap
1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off
@chbatey
Links
• Examples:
- https://github.com/chbatey/spring-cloud-example
- https://github.com/chbatey/dropwizard-hystrix
- https://github.com/chbatey/vagrant-wiremock-saboteur
• Tech:
- https://github.com/Netflix/Hystrix
- https://www.vagrantup.com/
- http://wiremock.org/
- https://github.com/tomakehurst/saboteur
@chbatey
Questions?
Thanks for listening!Questions: @chbateyhttp://christopher-batey.blogspot.co.uk/
http://www.eventbrite.com/e/cassandra-day-paris-france-2015-june-16th-2015-tickets-15053035033?aff=meetup1
http://www.eventbrite.com/e/cassandra-day-london-2015-april-22nd-2015-tickets-15053026006?aff=meetup1