Date post: | 29-Jan-2018 |
Category: |
Technology |
Upload: | sammarshallou |
View: | 379 times |
Download: | 5 times |
Moodle at scale: why assigning a role can cause a catastrophe
sam marshall, The Open University
Contents
● Introduction to the Open University’s main Moodle infrastructure
●What went wrong with it
●How we found the problem
●Credits
●All the actual work was a team effort
● It wasn’t necessarily me who did the clever bits
●Particular mentions: Tim Hunt; Linux infrastructure team
●Audience participation: prepare your ominous noise
2
Open University VLE
●Moodle 3.1 (in October 2016,
3.0) plus many custom plugins
●RedHat Enterprise Linux 7, PHP
5.6 (then 5.4) and Postgres 9.3.
●Cluster of virtual servers
● 8 external (student) web servers
● 3 internal (staff) web servers
● 1 large database server (supposedly
a failover pair)
System that provides websites for all our study modules
3
This simplified diagram misses out lots of things you probably don’t care about: file and memcache servers, front-end load balancers, etc.
421 March 2017, 20:13:47
Peak usage
As measured by counting entries in mdl_log
5
0
10
20
30
40
50
60
70
80
90
100
Dec 1-7(normal)
1 Oct 2 Oct 3 Oct 4 Oct 5 Oct 6 Oct 7 Oct
Median log entries/s
99th percentile
Reliability
●Planned downtime for quarterly updates
●Other downtime rare
●Usually not our fault (network, authentication system, cosmic rays*)
●No problems noticed in spring and summer 2016
● [Insert ominous noise here]
* Not really – but we did have a memory failure on the virtual machine host at one point, which is pretty much the same thing
Normally very good
6
What happened on
2 October 2016?
7
8
Problem monitored
9
●Monitoring tools
●Splunk (commercial)
● TTM (internal)
●Nagios (open source)
●Graphs show median time for
course view page
●When high it was typically returning
an error message, not just slow
●Database server kept collapsing
(very high load)
●Usually didn’t recover unless
restarted
Mon Oct 2
Tue Oct 3
Oct 6 (normal)
Normal usage volume
Hypothesis: hardware
● Load not substantially larger than last year…
●…But we did move database servers from physical to virtual
●Database server: 8 cores, load average 300+
● Linux load average number conflates CPU and IO contention
●Not sure everyone understood this at the time (I didn’t)
●Doubled the number of cores to 16
●Made no difference
Was the hardware too slow to cope with the load?
10
Hypothesis: slow ForumNG query
● Looked at database query status (when not broken)
●Handful of long-running queries (>1 minute)
●Caused by ‘show usage’ feature (for tutors) in ForumNG
●Already identified as a problem, with fix in next release
●Changed Moodle role settings to prevent use of this feature
● [Insert ominous noise here]
●Made no difference
Was a long-running query in our forum to blame?
11
Why does it
only break in
office hours?
12
Hypothesis: memcache flushing
●Moodle caches data, e.g. about courses
●Regenerating takes several seconds per course
●Our setup has MUC stored in ‘memcache’
●Each web server has its own cache
●Read locally (very fast), write to all the others
●Nagios shows memcache statistics
● cmd_flush increasing every few minutes
Was flushing of the Moodle MUC the cause?
13
●Bad effect if the cache is flushed:
● Likely 200+ different courses accessed in next minute
● If each one takes several seconds of database effort
●And this happens every few minutes…
Memcache flushing, continued
●Clearing any single area of MUC flushes the entire cache
●Memcache limitation (doesn’t apply using file cache, memcached, or redis)
●Patched in a hack to limit flushes to once per hour
●Also made it trace what was causing the (blocked) flushes
●Changes in Moodle 2.9 (unnoticed) meant some staff editing actions clear areas
●Preventing flush resulted in slightly incorrect behaviour but nothing serious
●Solution improved performance…
● The hourly flushes are visible spikes on our performance monitoring graphs
●…but it didn’t solve the problem
Why is the cache being flushed?
14
Hypothesis: loading permissions
●We examined the database queries at the point where it died
●Difficult as database is obviously running at a crawl
● Trying to be quick - we had to restart it to get service back online
● Lots of queries loading user permissions
●Moodle loads permissions when you log in: get_user_access_sitewide.
● This is a necessarily slow query (> 300ms on our system)
●After investigation, we found that it reloads permissions if a relevant
context is ‘dirty’ (using the mdl_cache_flags table, accesslib/dirtycontext)
● If the system context is dirty it will reload everyone’s permissions
● If there are 1,000 different users in a minute it will happen for all of them
Was Moodle loading user permissions too often?
15
Loading permissions, continued
●System context is marked dirty for things like the following:
●Changing permissions for any role like I did earlier to prevent the slow forum query
●Adding/removing somebody to a system-wide role
●Checked logs
●System-wide roles added several times a day (mainly various helpdesks)
● Found exact times
●Compared against exact times of system failures
●Almost a 1:1 match
●Patched in a hack to prevent it marking system context dirty
●Means changes at system context won’t take effect until logout
●Our system stayed up again!
Why is it loading permissions too often?
16
Aftermath
Final thoughts
17
Effect on students
●System down, or slow, for several hours on two of the busiest days
●Students just starting courses – not a great impression
How bad was this?
18
Testing
● In a more relaxed atmosphere…
●Simulated high student use
●Assigned role at system level (1)
●Didn’t kill it but made a visible dent
●Cleared MUC (2, 7)
●Visible but minor performance impact
●Batch assigned 22 system level roles (8)
●Caused database to collapse, as seen on live
We reproduced the problem to fully verify the changes
19
The mysteriously unmentioned points 3-6 were similar actions but with the patches in place to prevent bad effects.
Requests handled
Average time taken per request
Our actions
●Moodle enhancement for ‘dirty context’ problem / query performance
●MDL-49398 should improve performance of the query (thanks to Skylar Kelty from U of Kent)
● If that doesn’t fix it, we’ll consider option to spread the refresh over some minutes
●Database connection pooling in our system
●Probably stop the database needing a restart
●Change cache configuration in our system
●Redis or memcached
How we can resolve these specific problems
20
Learning points
● False sense of security
●Normal operation (outside peak) unlikely to show performance problems
●We do automated load testing every release, but this only covers student usage
●We don’t know what might have changed in Moodle versions
●Should prepare for the peak
●Nobody really looks at the monitoring unless it goes wrong
●Should manually examine live system before peak period
My personal opinions about the incident
21
Conclusion
●Ensure you have, and use, monitoring tools
●Open source or commercial performance monitoring
●Make sure you can do query dumps on database
●Do performance testing (Apache Jmeter, etc)
●But don’t trust it exclusively
●Developers: Please consider performance carefully
●Avoid unnecessary performance surprises for administrators.
●Example: notifications change in Moodle 3.1 added an unnecessary AJAX
request on every single page load (MDL-57968)
●Will this push OU systems over the edge next October?
General advice for anyone running a large Moodle system
22
Thanks for
listening!
23