+ All Categories
Home > Documents > 8th Sakai Conference4-7 December 2007 Newport Beach Sakai 2.4 Production Performance Experiences...

8th Sakai Conference4-7 December 2007 Newport Beach Sakai 2.4 Production Performance Experiences...

Date post: 24-Dec-2015
Category:
Upload: luke-bennett
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
75
8th Sakai Conference 4-7 December 2007 Newport Beach Sakai 2.4 Production Sakai 2.4 Production Performance Performance Experiences Experiences Berkeley, Cape Town, Indiana, Michigan, Stanford
Transcript

8th Sakai Conference

4-7 December 2007Newport Beach

Sakai 2.4 Production Sakai 2.4 Production Performance ExperiencesPerformance Experiences

Berkeley, Cape Town, Indiana, Michigan, Stanford

2

Session OverviewSession Overview

• Introductions

• Raison d’etre

• Questions posed

• Campus responses

• Discussion

3

Campus RepresentativesCampus Representatives• Indiana University

– Lance Speelman

• Stanford University– Lydia Li

• University of California Berkeley– Kevin Chan

• University of Cape Town– Greg Doyle

• University of Michigan– Jeff Cousineau

4

Raison d’etreRaison d’etre

• Unexpected performance problems deploying Sakai 2.4 on large campuses– Despite lengthy pilot runs with previous versions

• Problems not consistent across all campuses deploying at large scale– Some already at full-scale deployment– Variations in tools, configurations, etc.?

• Sharing lessons learned with community

5

Questions Posed (1)Questions Posed (1)

• Reference Materials– Provide a very quick overview of your production

infrastructure (hardware and software)– What Sakai tools comprise your deployment?

6

Questions Posed (2)Questions Posed (2)

• Deployment Decisions– What was the production scale change, if any,

that you made when moving to 2.4?– What factors were involved in your decision to

move to Sakai 2.4 along with your production scale change decision? • In other words, how “safe” did you feel about making

these changes and why?

7

Questions Posed (3)Questions Posed (3)

• Performance Problems Surface– How did you know you had a “serious

performance problem”? • What did you observe in production?

– What steps did you take to resolve the problem(s)?

– If you did not experience a serious performance problem, why do you think that was?

8

Questions Posed (4)Questions Posed (4)

• Lessons Learned– What, if anything, would you do differently as a

result of this experience?– What, if anything, should the Sakai community

take away from your experiences?

9

Campus ResponsesCampus Responses

• Indiana

• Stanford

• Berkeley

• Cape Town

• Michigan

8th Sakai Conference

4-7 December 2007Newport Beach

Indiana UniversityIndiana University

Lance Speelman

11

Indiana UniversityIndiana University

• Eight campuses statewide

• Central delivery of common IT services

• 97,959 Students

• 7,032 Faculty

• 11,537 Staff

12

IU Intelligent InfrastructureIU Intelligent Infrastructure

• Hardware Virtualization

• VMware ESX - Redhat, Apache, Tomcat

• IBM LPAR - AIX, Oracle

• Hitachi TagmaStore SAN

• BlueArc NAS

13

IU Virtual CloudsIU Virtual Clouds

14

IU Oncourse CL Current ConfigurationIU Oncourse CL Current Configuration

• 8 x Application V.Servers (32 CPU, 96GB RAM total, 10GB heaps)

• 1 x Database V.Server (13 CPU, 32GB RAM total)

• 35,000 web requests per minute peak (so far)

Oracle DatabaseOracle Database

15

IU Adoption RateIU Adoption Rate

16

IU Tool UsageIU Tool Usage

17

IU Sakai 2.4 DeploymentIU Sakai 2.4 Deployment

• Were still trying to fill functionality gaps

• Needed functionality in 2.4

• Past performance is no guarantee of future results

• Subtle changes can have large impacts

18

How did we know we had a problem?How did we know we had a problem?

• Day One– Phones begin to ring– Extreme slowness throughout day– System down

• Day Two– System Down

• Days Three & Four– Limping

19

What did we do?What did we do?

• Replaced DBCP with C3P0

• Disabled User Presence

• Oracle tuning

• Doubled JVM heap

• Profiling

• Moved to 64-bit JVM (32GB heap 80GB)

20

Lessons LearnedLessons Learned

• More timely communication with customers

• Quality Assurance– Load testing– Unit testing– Automated UI testing– Profiling

• We must get better at producing enterprise quality software.

8th Sakai Conference

4-7 December 2007Newport Beach

Stanford UniversityStanford University

Lydia Li

22

Production Scale ChangeProduction Scale Change

• Fall 2007: • Upgraded to Sakai 2.4.x,

– ~980 sites.– ~14800 unique users

• Summer 2007: – Turned off legacy system. – Deployed Sakai 2.3.x with 217 sites (~5100

users)

• Ran small pilots during the previous year.

23

Production InfrastructureProduction Infrastructure

24

Production Infrastructure (cont’d)Production Infrastructure (cont’d)

• Storage:– NetApp 3050AS redundant head filer.– Fiber channel & SATA storage.

• Database:– Oracle 10.2.0.3 running on– Sun v490, 8GB RAM, Solaris 8.

• Blob storage on SATA disk, database on Fiber channel.

• Database talks to Storage over NFS, private network.

25

Sakai Tools in ProductionSakai Tools in Production• Fall 2007:

– Home, – Announcements, – Samigo (aka.Tests and Quizzes), – Discussion (phpBB), – Drop Box, – Gradebook, – Content Resources, – Schedule, – Section Info, – Site Info, – Syllabus, – Web Content

26

Why Moving to Sakai 2.4Why Moving to Sakai 2.4

• CM section support for large lecture courses. – Major decision factor

• Self-serve worksite setup– Important for user support folks.

• Other features in Sakai 2.4. – Nice to have, not a major decision factor.

• As it turned out CM section support had more problems than we thought. (2 JIRAs still outstanding)

27

Working towards DeploymentWorking towards Deployment

• Everyone work hard over the summer.• No load test done prior to deployment

– We had no resources to set up load test environment that mirrored our production environment.

– We had no in-house performance testing expertise.

• Realized the risk, but also felt somewhat ‘safe’ because other bigger schools were going to 2.4.

• Deployed 2.4.x in our busiest term.

28

What We SawWhat We Saw

• Crisis during the first week– Very slow, or hanging on login.– Lots of ‘pool exhausted' errors.– Many users could not login due to corrupted

myWorkspace. – DB crashed abruptly twice. (in the first 2 weeks)– High volume of help tickets from users.

29

What We DidWhat We Did

• Application level– Changed refreshUser() code to speed up login.– Patched SAK-9860 : Excessive db queries generated

from Site Info / user service.– Extended CMAPI to term-specific queries for worksite

setup.– Switched from dbcp to c3p0. Later reduced the number

of c3p0 connections per server from 50 to 25.– Fixed corrupted myWorkspace by dropping all users’

myWorkspace sites. – Archived sakai_event and sakai_session tables.

30

What We Did (cont’d)What We Did (cont’d)

• Database/OS level:– Adjusted SGA lower from 5.5GB to 4GB to

reduce Oracle memory footprint. – Increased Temp/Undo table spaces apart from

sakai application table space. – Added an index on

CM_MEMBER_CONTAINER_T.CLASS_DISCR– Adjusted /etc/system shared memory parameters

to address swap page file sizes. (based on nothing more than a google search ;)

31

What We Did (cont’d)What We Did (cont’d)

• User support spent days helping the students who couldn't log in, such as sending their course materials through email, signed up for discussion sections.

• Updated MOTD to inform users of system status.• Advised users to use CW during off-peak times

(especially for large classes) and posted peak/off-peak times on MOTD.

• A lot of apologizing/groveling to end users.

32

DB Load in Last 6 MonthsDB Load in Last 6 Months

33

What we have LearnedWhat we have Learned

• Mirror production environment in your load test environment as best as you can.

• Load test, Load test, Load test!

• Write good scalable code! Stability/performance is an important ‘feature’.

• Sakai community is very helpful!

8th Sakai Conference

4-7 December 2007Newport Beach

bSpace Performance bSpace Performance Experience – Fall 2007Experience – Fall 2007

A Review of UC Berkeley’s Deployment of Sakai 2.4.xKevin Chan – bSpace Release Manager

35

bSpace Hardware BackgroundbSpace Hardware Background

• App Servers – Sun Fire V210 (x10)4 GB RAM, 2 x 1 GHz UltraSPARC III

• Oracle DB Server – IBM p57012 GB RAM, 3 x 1900 MHz PowerPC 5

• Oracle 9i(switched to 10g mid semester – 10/21)

• Load Balancer – Foundry ServerIron XL; distribution via “least current number of connections”

36

Other Useful Background InfoOther Useful Background Info

• bSpace has been running on Sakai 2.1.2 (Fall 2005 – Summer 2007)

• Sakai tools deployed include – Assignments (with Gradebook integration)– Forums (from Message Center)– Quiz and Survey (Samigo)– Roster

• bSpace hardware managed by central IT

37

Plans for bSpace for Fall 2007Plans for bSpace for Fall 2007

• Move from Sakai 2.1.2 to 2.4– Many problems with 2.1.2 – DB connection pool

issues– Take advantage of latest code (vs. 2.2 or 2.3)– UCB developed Gradebook – no plans to back

port (development against trunk)

• Blackboard retirement for Fall 2007

• CourseWeb (homegrown CMS) retirement for Spring 2008

38

UC Berkeley Course Sites (as of 10/4)UC Berkeley Course Sites (as of 10/4)

39

Moving to Sakai 2.4Moving to Sakai 2.4

• Generally feeling confident – started work early, but lots to do

• Course Management integration began in March 2007 (2.4 released in May)

• New deployment script developed for Sakai 2.4 – externals to 2.4.x

• Customized Site Info for UC Berkeley

• Sakai 2.4 deployed in early August

40

First Signs of Performance IssuesFirst Signs of Performance Issues

• First report from user – system seems to be down (September 11, around 11:30pm)

• Subsequent reports (from the same user) on– September 13, ~10:30pm– September 14, ~2:45am

• Reports from other users began to trickle in, “slow response”, “cannot log in”, “cannot upload file”, “bSpace is down”

41

Server Side SymptomsServer Side Symptoms

• bSpace developers/support team have access to view app server graphs

42

So What is Causing This?So What is Causing This?

• Unable to reproduce issue on QA – no load

• Professors and TAs complaining about performance issues with Assignments tool (particularly in large classes > 100 students)

• UC Berkeley began the Fall 2007 semester running 2.4.x branch of Assignments Tool

• First attempt to fix issue – move to post-2-4 branch (Sep. 21) – better, but still slow

43

Other Performance FixesOther Performance Fixes

• JAVA_OPTS was set improperly; GC performed very poorly – fixed on 9/28

• User Directory lookups improved – 10/10

• Performance issue still not resolved, in fact, the frequency and severity of the issue increases – 10/16 and 10/17

• UC Berkeley in Crisis (email sent to community on 10/17)

44

Not Looking GoodNot Looking Good

• See below on 10/17; 4 out of 10 servers experienced this issue

45

Sakai Community in ActionSakai Community in Action

• DB Query Improvements – UCB DBA initiated – 10/18

• Integration query – instructors and courses

• Full table scan in Forums

• Presence

• Turn off sakai.presence; include performance fix in 2.4.x branch of Forums – 10/20

• No problems…until 10/24 (yep, Wed.)

46

Fix AssignmentsFix Assignments

• Duplicate Assignment Submission problem (known since mid-September – switch to post-2-4)

• Patch (from trunk) applied on Nov. 2 – includes DB conversion script

• No performance issues since Nov. 2

• But we are seeing functional issues on post-2-4 Assignment Tool

47

UCB Lessons LearnedUCB Lessons Learned

• Have the appropriate personnel in place– Operations Manager– Release Manager

• Fewer major changes at once (if possible)– Moving from 2.1.2 to 2.4– Deployment/SVN changes– CM implementation and Site Info rewrite

• Load simulation and testing

48

For the CommunityFor the Community

• Ask for help from the community early and often (even if you do not fully understand the problem)

• Have load creation capabilities for troubleshooting performance issues – looking at logs and the sakai.events table is not enough

• Do not abandon the maintenance branch – assignment 2.4.x

8th Sakai Conference

4-7 December 2007Newport Beach

Performance management Performance management of Sakai at UCTof Sakai at UCT

Greg Doyle & Stephen Marquard

Center for Educational Technology

University of Cape Town

50

Vula production platformVula production platform• Scaled up during 2007: added servers and RAM

Feb 2007 (2.2) Jul 2007 (2.3) Oct 2007 (2.4)

Server 1 Apache+mod_jk

Mysql

Tomcat

Apache + mod_jk (4G)

Apache + mod_jk (4G)

Server 2 2 x Tomcat (4G) 2 x Tomcat (4G) Tomcat (8G)

Server 3 2 x Tomcat (4G) Tomcat (8G)

Server 4 Mysql (4G) Tomcat (8G)

Server 5 Tomcat (8G)

Server 6 Mysql (16G, 4 x CPU)

51

Tools deployed in Vula (2.4)Tools deployed in Vula (2.4)

• Sakai 2-4-x maintenance branch• Post-2.4 versions of:

– Assignments, Gradebook, Roster

• Provisional tools– Search, T&Q, Post’Em, Podcasts

• Not using OSP yet (except for Glossary)• Contrib and third-party tools

– Sitestats, Melete, Sakai Maps, Turnitin integration, LAMS2 integration

52

Deployment of Sakai 2.4Deployment of Sakai 2.4• Deployed in Jun 07 for 2nd term starting Jul 07. Didn’t

expect significant change in extent of use from Sakai 2.3 to Sakai 2.4 (about 10% increase):

• Relatively complex upgrade: new course management, many new versions of tools with new features. However, weren’t expecting performance issues.

53

Performance problems (Feb 07)Performance problems (Feb 07)

• First performance issues in Feb 07:

– Insufficient hardware capacity for volume of use

– Excessively high db queries from Forums

• Implemented a set of performance metrics as a result (Vula Dashboard). Key measurement is “page request time” – time taken to fetch a page from content hosting.

54

Key performance metricsKey performance metrics• https://vula.uct.ac.za/web/dev/Dashboard/• Users and throughput:

– sessions and active users– https requests/second– mysql queries (cached / non-cached): read / update– events

• Performance:– Page request time (should be flat, < 120ms)– Tool time-to-first-byte for GETs (averages 200-300ms)

• “Nominal” / “in trouble”– Pending loadbalancer requests and httpd processes

(typically < 100)– mysql connections (10 per app server = 40 total, should be flat)– mysql slow queries / minute (none except for daily backup)

55

Performance problems againPerformance problems again (Jul 07)(Jul 07)

• App servers unable to cope with peak loads (especially during tutorial signup)

• Observed problems:– High response times– High load average / CPU use on app servers– Out of memory issues– App servers became unresponsive– Db pool exhausted– High number of slow mysql queries (>1s)

56

Investigation of performance issuesInvestigation of performance issues

• Looked at stack traces on particular app servers (kill -3) when showing high load:– Lots of threads busy with assignments– Post-2.4. Assignments issue with non-electronic assignments (n2 loading of

assignment submissions)

• Added a query to show event counts in the last 5 min, hour and 24 hours– Excessively high volume of presence events– Caused by bug in courier code generating a presence event every 30s for

every logged-in users: high volume of db writes impacted on mysql performance

• Slow response times from Forums in sites with (a) many groups, (b) lots of messages– Looked at tool time-to-first-byte from apache logs to create a histogram of

average response times and find problematic sites

57

Resolving performance issuesResolving performance issues (software)(software)

• Assignments: worked with UM (Zhen Qian) to resolve the issues and deployed several iterations of the post-2.4 branch

• Excessive events from courier: fixed bug ourselves (simple fix)

• Forums response time: reported results to Indiana (Michelle Wagner) and deployed updated 2-4-x branch with performance fixes

58

Resolving performance issues Resolving performance issues (hardware)(hardware)

• Moved to 64-bit app servers to allow using more RAM (avoid OutOfMemory errors)– Moved from 4 x 1.2G heap (4.8G total) to 4 x 5.2G heap (20.8G

total): also headroom for future growth

• Moved mysql to a 4xCPU 16G RAM server

• With a larger heap, JVM tuning was critical: incorrect settings led to either long pause times (from 4s to 20s) or low pause times (<100ms) but with very poor overall performance.

• Currently using:

-server -d64 -Xms5200m -Xmx5200m -Xmn1g -XX:MaxPermSize=512m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:GCTimeRatio=19 -XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=31

59

Lessons and recommendationsLessons and recommendations• Set up performance metrics for all aspects of the system

(front-end / app servers / database), so that you can see when performance characteristics change significantly (before/after comparison), and where they are changing

• Do performance testing of new features with realistic data sets

• Provision hardware with enough headroom to survive unexpected performance impacts from software changes and bugs (in the short term)

• File JIRAs as soon as possible with as much detail as possible.

8th Sakai Conference

4-7 December 2007Newport Beach

CTools with Sakai 2.4CTools with Sakai 2.4

Jeff Cousineau

University of Michigan

61

CTools Hardware InfrastructureCTools Hardware Infrastructure• Load balancers (2)

– Citrix NetScaler RS9800 HA Application Switch, 10/100/1000Mbps copper, 1GB memory

• Application servers (8)– Dell PowerEdge 2650, dual 3.2GHz CPU, 4GB RAM

• Database server (1)– Sun T2000, 8-core, 1.0GHz UltraSPARC T1 Processor, 16GB RAM– Have hardware for RAC but have not deployed

• Database storage– Network Appliance FAS3020

• File storage– 1.5+TB AFS (moving to NetApp winter 2008)

62

CTools Software InfrastructureCTools Software Infrastructure

• Application servers– RHEL 3– Apache 1.3.x– Tomcat 5.5.x, Java 1.5.0_10– CTools 2.4.x

• Database server– Solaris 10– Oracle 10.2

63

Sakai 2.3 ToolsSakai 2.3 Tools

OCWePortfolioPostemiTunes UPodcastsSearch

ModulesResourcesScheduleSite InfoSyllabusWeb Content NewsWikiSynoptic tools Help

Worksite SetupHomeAnnouncementsAssignmentsChatDiscussionDropboxEmailGradebookMessage CenterLibrary ReservesGradtools

Production Stealth Pilot

64

Sakai 2.4 ToolsSakai 2.4 Tools

OCWChecklistCourseEval

ePortfolioPostemiTunes UPodcastsSearchGoal ManagementTest & QuizTest CenterBloggerPollsPageOrder project sites

CourseEval midterm

Message CenterMessagesModulesPodcasts (project sites only)

ResourcesScheduleSite InfoSyllabusWeb Content NewsWikiSynoptic tools Help

Worksite SetupHomeAnnouncementsAssignmentsCitationsChatDiscussionDropboxEmailForumsGradebookGradtoolsLibrary Reserves

Production Stealth Pilot

65

No Production Scale ChangeNo Production Scale Change

• Michigan turned off legacy system April 2005

• Suffered scale problems throughout Fall 2005 and Winter 2006

• Invested in load testing tools Spring 2005

66

2.4 Go/No Go Process2.4 Go/No Go Process

• CTools Advisory Committee (CTAC) makes final decision based on– Perceived value of version changes for campus– Results of functional tests– Results of performance tests– Analysis of associated risks

67

Tool Value & Functional TestsTool Value & Functional Tests

• Perceived tool value– Large number tools improved, replaced and/or

optimized for usability and efficiency

• Functional test team reported– 4 low usage tools needed code cleanup– Expected to be complete by target Roll date– 3 tools needed minor bug fixes– Patches expected before Fall 07 classes started

• No blockers

68

Performance Testing GoalsPerformance Testing Goals

• Run targeted 2.4 Samigo and Mneme tests to assist developers using 2.3 baseline activity rates

• Increase baseline activity by 25% for 2.4• Test new tools

– Individually– With baseline load

• Identify potential performance issues & minimize risk (if possible)

69

Configuration ChangesConfiguration Changes

• Java config– -Xms1500m -Xmx1500m -XX:NewSize=400m -XX:MaxNewSize=400m -

XX:PermSize=256m -XX:MaxPermSize=256m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC

• Database connection pool settings (dbcp)– Improve app server interaction with database

[email protected]=1• [email protected]=14• [email protected]=15

[email protected]=14• [email protected]=-1

70

Load Testing Results & DecisionLoad Testing Results & Decision

• Tests consistently showed very high CPU utilization on database server

• Little testing completed on new tools

• Decided insufficient to block roll, moved to 2.4 on July 14, 2007

71

Production PerformanceProduction Performance

• No significant problems related to move to 2.4.x

• Continuous DB monitoring, tuning throughout semester

• Issue with CTools-specific provider

• Presence bug

• App server restarts

72

Additional ObservationsAdditional Observations

• User/Session disparity in application server graphs

73

What Would We Do Differently?What Would We Do Differently?

• More in-depth testing– Time and resource contraints

• Work more closely with tool developers to test earlier

74

Lessons LearnedLessons Learned

• While there's tremendous desire to make Sakai easy to deploy and run out of the box, it's simply not there yet... At least not for large institutions

• The time spent tuning database connection pool settings was time well invested

• We’re blessed with highly skilled technical staff• We’re not trading in our performance testing

environment!• Long experience pays off?

8th Sakai Conference

4-7 December 2007Newport Beach

DiscussionDiscussion

Panel responds to

questions & comments

from audience & each other


Recommended