Moodle at scale why assigning a role can cause a catastrophe

Moodle at scale: why assigning a role can cause a catastrophe

sam marshall, The Open University

Contents

● Introduction to the Open University’s main Moodle infrastructure

●What went wrong with it

●How we found the problem

●Credits

●All the actual work was a team effort

● It wasn’t necessarily me who did the clever bits

●Particular mentions: Tim Hunt; Linux infrastructure team

●Audience participation: prepare your ominous noise

2

Open University VLE

●Moodle 3.1 (in October 2016,

3.0) plus many custom plugins

●RedHat Enterprise Linux 7, PHP

5.6 (then 5.4) and Postgres 9.3.

●Cluster of virtual servers

● 8 external (student) web servers

● 3 internal (staff) web servers

● 1 large database server (supposedly

a failover pair)

System that provides websites for all our study modules

3

This simplified diagram misses out lots of things you probably don’t care about: file and memcache servers, front-end load balancers, etc.

421 March 2017, 20:13:47

Peak usage

As measured by counting entries in mdl_log

5

0

10

20

30

40

50

60

70

80

90

100

Dec 1-7(normal)

1 Oct 2 Oct 3 Oct 4 Oct 5 Oct 6 Oct 7 Oct

Median log entries/s

99th percentile

Reliability

●Planned downtime for quarterly updates

●Other downtime rare

●Usually not our fault (network, authentication system, cosmic rays*)

●No problems noticed in spring and summer 2016

● [Insert ominous noise here]

* Not really – but we did have a memory failure on the virtual machine host at one point, which is pretty much the same thing

Normally very good

6

What happened on

2 October 2016?

7

8

Problem monitored

9

●Monitoring tools

●Splunk (commercial)

● TTM (internal)

●Nagios (open source)

●Graphs show median time for

course view page

●When high it was typically returning

an error message, not just slow

●Database server kept collapsing

(very high load)

●Usually didn’t recover unless

restarted

Mon Oct 2

Tue Oct 3

Oct 6 (normal)

Normal usage volume

Hypothesis: hardware

● Load not substantially larger than last year…

●…But we did move database servers from physical to virtual

●Database server: 8 cores, load average 300+

● Linux load average number conflates CPU and IO contention

●Not sure everyone understood this at the time (I didn’t)

●Doubled the number of cores to 16

●Made no difference

Was the hardware too slow to cope with the load?

10

Hypothesis: slow ForumNG query

● Looked at database query status (when not broken)

●Handful of long-running queries (>1 minute)

●Caused by ‘show usage’ feature (for tutors) in ForumNG

●Already identified as a problem, with fix in next release

●Changed Moodle role settings to prevent use of this feature

● [Insert ominous noise here]

●Made no difference

Was a long-running query in our forum to blame?

11

Why does it

only break in

office hours?

12

Hypothesis: memcache flushing

●Moodle caches data, e.g. about courses

●Regenerating takes several seconds per course

●Our setup has MUC stored in ‘memcache’

●Each web server has its own cache

●Read locally (very fast), write to all the others

●Nagios shows memcache statistics

● cmd_flush increasing every few minutes

Was flushing of the Moodle MUC the cause?

13

●Bad effect if the cache is flushed:

● Likely 200+ different courses accessed in next minute

● If each one takes several seconds of database effort

●And this happens every few minutes…

Memcache flushing, continued

●Clearing any single area of MUC flushes the entire cache

●Memcache limitation (doesn’t apply using file cache, memcached, or redis)

●Patched in a hack to limit flushes to once per hour

●Also made it trace what was causing the (blocked) flushes

●Changes in Moodle 2.9 (unnoticed) meant some staff editing actions clear areas

●Preventing flush resulted in slightly incorrect behaviour but nothing serious

●Solution improved performance…

● The hourly flushes are visible spikes on our performance monitoring graphs

●…but it didn’t solve the problem

Why is the cache being flushed?

14

Hypothesis: loading permissions

●We examined the database queries at the point where it died

●Difficult as database is obviously running at a crawl

● Trying to be quick - we had to restart it to get service back online

● Lots of queries loading user permissions

●Moodle loads permissions when you log in: get_user_access_sitewide.

● This is a necessarily slow query (> 300ms on our system)

●After investigation, we found that it reloads permissions if a relevant

context is ‘dirty’ (using the mdl_cache_flags table, accesslib/dirtycontext)

● If the system context is dirty it will reload everyone’s permissions

● If there are 1,000 different users in a minute it will happen for all of them

Was Moodle loading user permissions too often?

15

Loading permissions, continued

●System context is marked dirty for things like the following:

●Changing permissions for any role like I did earlier to prevent the slow forum query

●Adding/removing somebody to a system-wide role

●Checked logs

●System-wide roles added several times a day (mainly various helpdesks)

● Found exact times

●Compared against exact times of system failures

●Almost a 1:1 match

●Patched in a hack to prevent it marking system context dirty

●Means changes at system context won’t take effect until logout

●Our system stayed up again!

Why is it loading permissions too often?

16

Aftermath

Final thoughts

17

Effect on students

●System down, or slow, for several hours on two of the busiest days

●Students just starting courses – not a great impression

How bad was this?

18

Testing

● In a more relaxed atmosphere…

●Simulated high student use

●Assigned role at system level (1)

●Didn’t kill it but made a visible dent

●Cleared MUC (2, 7)

●Visible but minor performance impact

●Batch assigned 22 system level roles (8)

●Caused database to collapse, as seen on live

We reproduced the problem to fully verify the changes

19

The mysteriously unmentioned points 3-6 were similar actions but with the patches in place to prevent bad effects.

Requests handled

Average time taken per request

Our actions

●Moodle enhancement for ‘dirty context’ problem / query performance

●MDL-49398 should improve performance of the query (thanks to Skylar Kelty from U of Kent)

● If that doesn’t fix it, we’ll consider option to spread the refresh over some minutes

●Database connection pooling in our system

●Probably stop the database needing a restart

●Change cache configuration in our system

●Redis or memcached

How we can resolve these specific problems

20

Learning points

● False sense of security

●Normal operation (outside peak) unlikely to show performance problems

●We do automated load testing every release, but this only covers student usage

●We don’t know what might have changed in Moodle versions

●Should prepare for the peak

●Nobody really looks at the monitoring unless it goes wrong

●Should manually examine live system before peak period

My personal opinions about the incident

21

Conclusion

●Ensure you have, and use, monitoring tools

●Open source or commercial performance monitoring

●Make sure you can do query dumps on database

●Do performance testing (Apache Jmeter, etc)

●But don’t trust it exclusively

●Developers: Please consider performance carefully

●Avoid unnecessary performance surprises for administrators.

●Example: notifications change in Moodle 3.1 added an unnecessary AJAX

request on every single page load (MDL-57968)

●Will this push OU systems over the edge next October?

General advice for anyone running a large Moodle system

22

Thanks for

listening!

23

Date post:	29-Jan-2018
Category:	Technology
Upload:	sammarshallou
View:	379 times
Download:	5 times

Moodle at scale why assigning a role can cause a catastrophe

Technology