Federated Learning with Proximal Stochastic Variance ... · FL with Proximal Stochastic Variance...

Federated Learning with Proximal Stochastic Variance Reduced Gradient Algorithms

Part of the Faculty of Engineering Summer Research Program 2019/20

Canh T. DinhThe University of Sydney

Nguyen H. TranThe University of Sydney

Tuan Dung NguyenThe University of Melbourne

Wei BaoThe University of Sydney

Albert Y. ZomayaThe University of Sydney

Bing B. ZhouThe University of Sydney

The University of SydneyFL with Proximal Stochastic Variance Reduced Gradient Algorithms

Outline

• Federated Learning

• System model

• Algorithm design

• Convergence analysis

• Experimental findings


Federated Learning* (FL)

• A fast-developing decentralized ML technique

• One global model, many local models in a network

• Pros: no need to send data, preserve privacy

UE 1

UE 2

UE 3

Federated Learning Scheme.

1. Local computation

2. Transmit Learning Parameters

3. Update Global Model

4. Update Learning

Parameters

* H. B. McMahan, E. Moore, D. Ramage, and S. Hampson, “Communication-Efficient Learning of Deep Networks from Decentralized Data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol. 54, Fort Lauderdale, FL, USA, 2017


Challenges of FL

- Systems heterogeneity: differences in hardware (storage, computational power, connection) among users

- Statistical heterogeneity: devices’ local data are non-identically distributed

Complicate algorithm design and convergence analysis

Our contributions

- FL algorithm using proximal stochastic variance reduced gradient (SVRG) method (FedProxVR): updating a model until some local accuracy threshold is achieved

- Convergence analysis: how to set the learning rate to achieve convergence

- Characterization of tradeoff between global and local convergence

- Method of minimizing the total training time


System Model

Fn(w) :=1

Dn

Pi2Dn

fi(w)<latexit sha1_base64="aHDBPjPJpnqbtIGDjnnpuYuzD6k=">AAANj3ichVZdb9s2FHW7r861t2R73AsxN0AypGnkYejaLUOqOMEGBEG2LGkByzUoiYoJk5JKUk0DTT9nf2hv+ze7pD6cSDQmIPEVec7l4bkUST9lVKr9/X8fPPzo408+/ezR5/3Hg+EXX25sfnUlk0wE5DJIWCLe+FgSRmNyqahi5E0qCOY+I6/95ZHuf/2eCEmT+E91m5IZx9cxjWiAFTTNNx//7cXkJkg4x3GYe3GiSDF1ZrlHeLrIPTNA7rOMFCOnKPr3wAkk9rEo8XxJRIycvR94Zjq0oKr1qbP34gXPRk7zCqDiLqPob7VkCH5STMdaRywzQThWizz3GImU9xckEvR6AVExz0fjokDebkuaTvAqxKkqxRkeuyJCNVyh3yyskuDTa9YQ4EWsxbv0uqS4dynuOsq5SPzK4NXEPP3f9/NzmImWOkUjB5U6Z0Xbm+MPleHASVKQGqFj75mN2bfyLOMe28e9Rz/FHfrIeZt/V2zLHajA1ilOGQ4IUgLHMgJj2vOmnJzpkk6jbhYv1d3IXsuLjJ8lp5R31oMnMw41YJRTJWElOMVbsxy6/LXs/+US1XUN0J5VKFGWYXJvV698Go7G3q6VF5KIvCvKagjC6tK+LA5aQPIOoF3gQfHSYllxX0btYh4fOEV+VngfZAr1ak+YZ6xF3LUoTkMiTC1hrpHAAbRgoShm4OMq1tVsLSLyBz3vGqrX3nb9Ze50Vh6QLuykaU3qLlcgHdlJpn7lJtCpN3zqFn16N9iuNoKOPGi0qNOUaUXpiINGizaz5+T1fmOTtkZbI86ubo28Rp9d4BqFK4k2ja7VPreyz7XZ51rtcyv7XJt9rtU+t7bPtdrn2u1za/tcq32u3T63ts+12ufa7XMb+1y7fcqPSpYiHyDWu1KxhfQL8hMWtsBUFSUSggpFFWY0aOEkq3AQVDjJcKxIO6EMamDQADlmDAU4bUFVPbZqxta3ixtBFREtLJypJRaCCisS6GuPfmfu8t7cJZwoSBJBoxYlS6vMEFTYLDVfdmtT8+tdE4ItpIPK0FQk72lIVkhjqgYYUw3SZiqXdUbZZFwnkxsDzJ7Na6zFAK7qoVUz9DpPXRBfXVQgkrccfoxjaAvpvkqQad9FmSQIS2RYuYdZusCthXeE2epSEeAmF7Qz+JpwuqCBRE9RErNbvRrkKilA8letdJPJKhtcL6pkkySDW2kIaXy4KSz9BIvQlMGeeDLReVt3n4wRscotA1EnNz3oIhA0VfZ8BtGReiLwcpUQzrJlnVH3qEysEuiGvO2cuRprfvuuXJ+v/f6Tk3mMtm92UHnQo/LEhINyMo/hbG0OZoo8GpeGTgroygsUzalmPunPN0b7e/vmQd3AqYJRr3rO5xv/eGESZJzEKmBYyqmzn6pZrk/mgIEqD2YF+pb4mkwhjDEncpabKcDkoSVEcImDv1gh03qXkWMu9eICpClDu0832vqmmYp+nOU0TjNF4qAcKMoYUolZ6iikggQKShdSDKUErShYYLALvgCpTXDaU+4GV+M95/u98e/j0aFb2fGo903v2952z+k97x32fu2d9y57wWAwGA9+Gvw83Bw+H/4yPCyhDx9UnK97957hb/8BcJv7DA==</latexit>

Individual loss function on each device:

Global minimization problem: minw2Rd F̄ (w) :=PN

n=1DnD Fn(w).

<latexit sha1_base64="Nh6phORwoiVM+8O3gL66RnIHC5k=">AAANpXichVZdb9s2FHW7r85N1nR73AtRN1gypF7kYdhaIECrOEEfsiBNmrSA5RiURMeESUklqSaBxl+2f7G3/ZtdUpKdSDQmIPEVec7l4bkUyTBjVKrd3X8fPPziy6++/ubRt93Ha+vfPdl4+v2FTHMRkfMoZan4GGJJGE3IuaKKkY+ZIJiHjHwI5/um/8NnIiRNk/fqNiNjjq8SOqURVtA0efr47yAh11HKOU7iIkhSRfTIGxcB4dmsCOwARchyonue1t174BQSh1iUeD4nIkFe/zee2w4jqGp94fVfvuR5z1u8AkjfZejuZkOG4Id6NDA6EpkLwrGaFUXAyFQFf0EiQa9mEOlJ0RtojYKdhjST4E2MM1WKszx2QYRacIV5c7BKQkiv2IIAL2Il3qdXJcW/S/FXUU5EGlYGLycWmP9hWJzATIzUEep5qNQ51k1vDm4qw4GTZiB1ig6CX1zMrpPnGPfAPe49+hFu0XveZfGz3pLbUIHNI5wxHBGkBE7kFIxpzptycmxKOpq2swSZ6UbuWp7l/Dg9ory1HgKZc6gBo5wqCSvB05d2ObT5K9n/yyWq7RqgA6dQohzDFMGOWfk07g2CHScvJlPySZfVEITVpX2l9xpA8gmgbeCefuWwTN+XUbtYJHueLo51cCMzqFdzwjxnDeKOQ3EWE2FrCXOdChxBCxaKYgY+LmNTzcYiIqf0pG2oWXtb9Ze53Vp5QDpzk0Y1qb1cgbTvJtn6lZtAq97wqTv0md1gq9oIWvKg0aHOUEYVpSUOGh3a7J5T1PuNS9oKbQtxbnUr5C30uQWuULiU6NLoO+3zK/t8l32+0z6/ss932ec77fNr+3ynfb7bPr+2z3fa57vt82v7fKd9vts+f2Gf77ZPhdOSpcgNxGZX0pvIvKAwZXEDTJUukRBUKKowo1EDJ1mFg6DCSYYTRZoJZVQDowWQY8ZQhLMGVNVjq8XY5nZxLagiooGFM7XEQlBhRQp9zdHvzF3em7uEEwVJIui0QcmzKjMEFTbP7Jfd2NTCeteEYBOZoDI0E+lnGpMl0ppqANZUi3SZymWdUS4yrpLJrQF2z+Y11mEAV/XQajH0Kk99EF9dVCCStxx+rGNoE5m+SpBt30G5JAhLZFlFgFk2w42Ft4/Z8lIR4UUuaGfwNeFsRiOJXqA0YbdmNchlUoAUbxrphsNlNrheVMmGaQ630hjShHBTmIcpFrEtgzvxcGjyNu4+OSNimVtGok5ue9BZJGim3PksoiX1UOD5MiGcZfM6o+lRuVgmMA1F0zl7NTb85l25Pl+73e5zWFbJpLhGAU1QbcqpvozhZIX7c3Go0db1NirvAah1UqPyiB1OEl0MNTqcJAbef97tTjZ6u/1d+6B24FVBr1M9J5ONf4I4jXJOEhUxLOXI283UuDDndcRAawBzBdVzfEVGECaYEzku7MTAEmiJEVzt4C9RyLbeZRSYS7PkAGmL0+wzja6+Ua6mf4wLmmS5IklUDjTNGVKp/QBQTAWJFBQ0phgKDFpRNMPgCXwX0pjgNafcDi4Gfe/X/uDdoPfar+x41Pmx86yz1fE6v3ded952TjrnnWjt2drbtXdrp+s/rf+5/n79ooQ+fFBxfujce9Yn/wF2wAOd</latexit>

Assumptions: ��rfi(w)�rfi(w0)�� L

��w � w0�� (1)

Fn(w) +⌦rFn(w), w

0 � w↵ Fn(w

0) +�

2

��w � w0��2 (2)��rFn(w)�rF̄ (w)

�� n

��rF̄ (w)��. (3)

<latexit sha1_base64="mmsAeaVr+1IWaUIiFn1A+sh1NVQ=">AAAOo3ichVddb9s2FHW7ryZrsnZ73As7w22yplnkYdhaoECrpEGBBV3WNmkByw0oiXYIk5JCUk0DjT9sf2Nv+ze7pCg7lmhMQGKKPOfy8FyKuooLRqXa2/v3xs3PPv/iy69ura1/fXtj85s7d789lXkpEnKS5CwX72MsCaMZOVFUMfK+EATzmJF38WzfjL/7SISkefZWXRVkzPE0oxOaYAVdZ3dv/x1l5DLJOcdZWkVZrogeBeMqIrw4ryI7QRWzkuh+oPX6EjiHwDEWNZ7PiMhQsPsLL+2AEeR6HwW7jx/zsh/MbwGkrzP0+qAlQ/BDPRoaHZksBeFYnVdVxMhERX9BIEGn59DSZ1V/qDWKdlrSTIDnKS5ULc7y2CkRas4V5s7DqgkxnbI5AW7ESnxIpzUlvE4JV1GORR47gxcLi8z/OK6OYSVG6gj1A1TrHOu2Ny8+OcOBkxcgdYJeRD/5mOtenmfeF/55l+hHuEPvBx+qH/WW3IYMDI5wwXBCkBI4kxMwpr1uyskrk9LRpBslKsww8ufyTclf5UeUd/ZDJEsOOWCUUyVhJwT6g90OXf5K9v9yieq6BujIK5QozzRVtGN2Pk37w2jHy0vJhFzoOhuCsCa1T/TTFpBcALQLfKqfeCzTyzIaF6vsaaCrVzr6JAvIV3vBvGQt4o5HcZESYXMJa50InEAPFopiBj4u2iabrU1EXtPjrqFm7201T+Z2Z+cB6Y2fNGpI3e0KpH0/yeavPgQ6+YZH3aPPnAZb7iDoyINOjzpDGTlKRxx0erTZM6dqzhuftBXa5uL86lbIm+vzC1yhcCHRpzH02hc6+0KffaHXvtDZF/rsC732hY19ode+0G9f2NgXeu0L/faFjX2h177Qb184ty/026fiSc1S5BO0zamkB8jcoDhnaQtMla6R0HAoqjCjSQsnmcNBw+Ekw5ki7YAyaYDJHMgxYyjBRQuqmrnVfG5TXVwKqohoYeGdWmOh4bAih7H27NfWLpfWLuGNgiQRdNKilIWLDA2HLQv7ZLcOtbg5NaExQKbhDC1E/pGmZIG0phqANdUifaZy2USU84irZHJrgD2zeYP1GMBVM7WaT73K0xDEu0IFWvKKw491DA2QGXOCbP8OKiVBWCLLqiLMinPc2nj7mC2KigTPY0E/g6cJF+c0kegRyjN2ZXaDXAQFSPW8Fe7gYBENygsX7CAvoSpNIUwMlcIszrFIbRr8gQ8OTNxW7VMyIhaxZSKa4HYEvUkELZQ/nkV0pB4KPFsEhHfZrIloRlQpFgFMR9V2zpbGht+ulZv36zpca2tRTKY0q2AjTTNt7u9F95CtHyv4wWAImpxRtHW5DeKXOh5sI43QfXhzXaCjhnIJqMsH0B9FEOzwLDPEh8iVrTibMtJEgUETdgfwhoRcNVtj0H2EbGQbAqYyQeoXOoNviBTraqjdpG7OD8N60mXxbpaF+Ag+DKpDbTpBptOPIkmnHAN2mb0E3jWScAzlzXMJJVphvlOevCTwXWJmJWC6c/HsTn9vd89eqNsIXKPfc9fx2Z1/ojRPSk4ylTAs5SjYK9S4MtVKwiB8BJmGnM3wlIygmWFO5LiyaYUNAT0pgsIW/jKFbO91RoW5NA8cIO3WbI+ZTt/YqFST38YVzYpSkSypJ5qUDKncPv4opYIkCrZzSjFsb9CKknMMOYJTQRoTgvaSu43T4W7w8+7wz2H/WejsuNX7vvdDb6sX9H7tPeu97B33TnrJBto43Phj43hzsPn75uvNtzX05g3H+a63dG2O/wM8x1FA</latexit>

1.

N

nn

D D=

=åThere are N users. Each user’s dataset size is nD . Total data size:


Algorithm Design

Local model update:minw2Rd

nJn(w) := Fn(w) + hs(w)

o, (1)

where hs(w) :=µ

2

��w � w̄(s�1)��2, (2)

<latexit sha1_base64="I4/xYtZ9Hqw1g/8jDk1n8bMRHEM=">AAAN93ichVZdb9s2FHW7r9aL13R73AsxI0WyOVnkYdhaoECrpsUwFEHWLG0BKzEoibaJkJJKUk0Djn9kL3vYMOx1f2Vv+ze7pCQ7kWjMQJsr8pzLw8Mr6sYFo1Lt7/974+Z773/w4Ue3bvc/3hh8cmfz7qcvZV6KhJwkOcvF6xhLwmhGThRVjLwuBME8ZuRVfP7Ezr96S4SkefazuizIKcfzjM5oghUMTe9uDKKMXCQ55zhLdZTliphJcKojwouFjtwCOmYlMcPAmP41cA6JYywqPD8nIkPB3re8dBNWUD26G+zdv8/LYbB8BJC5yjD9rZYMwZ+ZydjqyGQpCMdqoXXEyExFv0AiQecLiMxUD8fGoGjUkmYTPE5xoSpxjsdeEqGWXGGfPKyKENM5WxLgQazFh3ReUcKrlHAd5UjkcW3wamOR/T+O9RHsxEqdoGGAKp2npu3N03e14cDJC5A6Q0+jr33MvpfnWfepf91r9Oe4Qx8GZ/pLsy134AS2nuOC4YQgJXAmZ2BMe9+Uk0N7pJNZN0tU2GnkP8vjkh/mzynv1EMkSw5nwCinSkIlBObMlUOXv5b9v1yiuq4BOvIKJcqzjI5GtvJpOhxHIy8vJTPyxlSnIQhrjvaBedgCkjcA7QIfmgcey8x1GY2LOnsYGH1ooneygPNqb5iXrEUceRQXKRHuLGGvM4ETGMFCUczAx1VsT7NVROQFPeoaamtvu3kzdzqVB6RjP2nSkLrlCqQnfpI7v+oS6Jw3vOoeffY22K4vgo48GPSos5RJTemIg0GPNnfn6Oa+8Ulbo20pzq9ujbylPr/ANQpXEn0aQ699YW1f6LMv9NoX1vaFPvtCr31hY1/otS/02xc29oVe+0K/fWFjX+i1L/TbFy7tC/32qXhWsRR5B7G9lcwWsg8ozlnaAlNlKiQENYoqzGjSwklW4yCocZLhTJF2Qpk0wGQJ5JgxlOCiBVXN2mq5tu0uLgRVRLSw8E2tsBDUWJHDXHv1K3uX1/Yu4YuCJBF01qKURZ0ZghpbFu7Nbl1qcXNrQrCFbFAbWoj8LU3JCulMtQBnqkP6TOWyySiXGdfJ5M4Ad2fzBusxgKtmabVcep2nIYivGxWI5CWHP84xtIXsXC3IjY9QKQnCEjmWjjArFrhVeE8wWzUVCV7mgnEGbxMuFjSRaBflGbu01SBXSQGiH7fSHRysskF7USc7yEvoSlNIE0OncB7nWKTuGPyJDw5s3lbvUzIiVrllIprkbgYdJ4IWyp/PITpSnwl8vkoI37LzJqOdUaVYJbADuu2ca40tv90rN9/Xfr8fxWROMw1VNM9M/zYUWTbVFyiiGWosemHOUvjO2ptN/zjNti920D1UNQboWfWMvkKLqYTIjFAUQRpb7/piQQRBpp5C92pO/VWGzlqPIa9tVGHFXRRBv64voNfZlrsB3HZn4xFkIrCTSt10c7i/t+9+qBsEdTDs1b+j6eY/UZonJSeZShiWchLsF+pU2w4gYbD7CNwDH87xnEwgzDAn8lQ7q8BkGEkRNIvwL1PIjV5laMylLWJAuuNuz9lB39ykVLPvTzXNilKRLKkWmpUMqdy9UiilgiQKSiSlGEoGtKJkgcEyeNNkH0wI2lvuBi/He8E3e+OfxsNHYW3Hrd7nvS96272g913vUe+H3lHvpJdsqI1fN37f+GNwOfht8Ofgrwp680bN+ax37Tf4+z8KESHZ</latexit>

(2) is a “soft” consensus constraint to penalize deviation from current global model

In each local iteration:

• Find a VR stochastic gradient estimator,

• Update the parameters using the proximal operator

• Send the updated parameters to the server

v(t)n,s = rfit(w(t)n,s)�rfit(w

(t�1)n,s ) + v(t�1)

n,s<latexit sha1_base64="2vqVMRKZ1LsMCgRvuApYqmnuW/Y=">AAAOWnicjVZdb9s2FHWzrW3cJE23ve2FmJEh2Zws8jBsLRCgVZJiD0GQLUtbwEoMSqJtIqSkkFQ+oPFP7mUYsL8yYJeUZCcSvc1AG4o8597Dc68ohhmjUu3u/vlo6aOPP3n85Oly99nK6trz9RefvpNpLiJyFqUsFR9CLAmjCTlTVDHyIRME85CR9+Hlvll/f02EpGnyq7rLyDnHk4SOaYQVTI1erGRBQm6ilHOcxEWQpIrooXdeBIRn0yKwCYqQ5UT3PK27D8ApBA6xKPH8kogEeTvf89wuGEHV7La38/Ilz3ve7BFA+j5DdzcaMgR/q4cDoyORuSAcq2lRBIyMVfAbBBJ0MoWRHhW9gdYo6DekmQBvYpypUpzlsXdEqBlXmCcHqySEdMJmBHgQC/E+nZQU/z7FX0Q5EWlYGTzfWGD+D8PiBHZipA5Rz0OlznPd9ObwtjIcOGkGUsfoMPjWxew6eY68h+68D+hHuEXveRfF13pTbkEFNo5wxnBEkBI4kWMwprlvysmxKelw3I4SZGYZuWt5mvPj9IjyVj8EMudQA0Y5VRI6wdMXth3a/IXs/+QS1XYN0IFTKFGONEXQN51P494g6Dt5MRmTK11WQxBWl/aV3msAyRVA28A9/cphmX4oo3axSPY8XRzr4FZmUK/mhnnOGsS+Q3EWE2FrCXsdCxzBDBaKYgY+zsemmo0mIr/Qk7ahpvc26zdzq9V5QDp1k4Y1qd2uQNp3k2z9ykOgVW941R36zGmwWR0ELXkw6VBnKMOK0hIHkw5t9swp6vPGJW2Btpk4t7oF8mb63AIXKJxLdGn0nfb5lX2+yz7faZ9f2ee77POd9vm1fb7TPt9tn1/b5zvt8932+bV9vtM+322fP7PPd9unwnHJUuQWxuZU0hvIPKAwZXEDTJUukTCoUFRhRqMGTrIKB4MKJxlOFGkGlFENjGZAjhlDEc4aUFXnVrPc5nZxI6giooGFb2qJhUGFFSmsNbPf27t8sHcJXxQkiaDjBiXPqsgwqLB5Zt/sxqEW1qcmDDaQGVSGZiK9pjGZI62pBmBNtUiXqVzWEeUs4iKZ3Bpgz2xeYx0GcFWnVrPUizz1QXx1UYGRvOPwxzqGNpBZqwTZ+T7KJUFYIssqAsyyKW403j5m80tFhGexYJ7B24SzKY0k2kZpwu5MN8h5UIAUbxrhDg7m0eB6UQU7SHO4lcYQJoSbwmWYYhHbMrgDHxyYuI27T86ImMeWkaiD2xV0GgmaKXc8i2hJfSvw5TwgfMsu64hmReViHsBMFE3n7NXY8Jt35fr72u0GIZnQpIAmmiS6+xW6vig24Ss3KpI+khrtoSDB4Agajwo6Uhpt3pglqUvYFuzGCUAlYtszmG/Q9ZxkppYRCq6uchwDOU1yHoI73Y3/mxz9S/YHkN1mcphoZ+7CMRjXBozWe7s7u/aH2gOvGvQ61e9ktP57EKdRzkmiIoalHHq7mTovzCUjYmBwAAUCqy/xhAxhmGBO5HlhqwF1hJkYwX0U/iUK2dn7jAJzad4TQNqOaq6ZSdfaMFfjH88LmmS5IklUJhrnDKnUvrUopoJECrowphi6ErSiaIrhrgQvszQmeM0ttwfvBjvedzuDnwe9135lx9POF50vO5sdr/ND53Xnp85J56wTrfyx8vfq49Unq3+tLa0trz0roUuPKs5nnQe/tc//AWpLP/s=</latexit>

𝜖-accurate solution: 1

T

XT

s=1E��rF̄ (w̄(s))

��2 ✏<latexit sha1_base64="hAAclW2RY+X9BocgQ4ioIKalacU=">AAAOKXichVZdb9s2FHW7r8azt3R73Asxw0BSuGnkYdhaIFirfGAPQZAtTVrAcgJKom3CpKSQVJNA09/Zy/7KXjZgw7bX/ZFdUpLtSDRmINEVec7l4SFFXj9hVKrd3b8fPHzv/Q8+/OjRRvvjTveTTzcff3Yh41QE5DyIWSze+lgSRiNyrqhi5G0iCOY+I2/8+b7uf/OOCEnj6LW6S8iY42lEJzTACpquHne+66OzhAQUM3RAJpAlRCexMr1tLyI3Qcw5jsInmYfFlOPbPIOnN9CRpZ9GVT9Eq/2ZF8WK5CNnnHmEJ7PMM9ozn6Uk7zl5DRyDZh+LAs/nRETI2fmap6ZDz7VsfersPH/O056zeAVQvsrI2/2aDMGP8tFQ64hkKgjHapZlHiMT5f0EiQSdziDKr7LeMM+RN2jMQ/BXIU5UIc7w2AURasEV+s3CKgg+nbIFAV7EWrxLpwXFXaW46yinIvZLg5cT8/R/389OYSZa6gj1HFToHOd1bw5vS8OBEycgdYIOvWc2ZtvKs4x7aB/3Hv0YN+g95zJ7km/JbViB/jFOGA4IUgJHcgLG1OdNOTnRSzqaNLN4ie5G9rU8S/lJfEx5Yz94MuWwBoxyqiTsBCe/NNuhyV/L/l8uUU3XAO1ZhRJlGSbzBnrn07A39AZWXkgm5DovVkMQVi3ti3yvBiTXAG0C9/IXFsvy+zIqF7Noz8mzk9y7lQmsV33CPGU14sCiOAmJMGsJc50IHEALFgqOJ/BxGYOd97cQ+ZGeNu3UO2+r+i63G/sOSGd20qgiNTcrkPbtJLN6xRHQWG340C369FmwVR4DDXnQaFGnKaOS0hAHjRZt5sTJqtPGJm2NtoU4u7o18hb67ALXKFxKtGl0rfa5pX2uzT7Xap9b2ufa7HOt9rmVfa7VPtdun1vZ51rtc+32uZV9rtU+126fu7DPtdun/EnBUuQWYn0m5X2kX5Afs7AGpiovkBCUKKowo0ENJ1mJg6DESYYjReoJZVABgwWQY8ZQgJMaVFVjq8XYumy5EVQRUcPCjVpgISixIoa++ugrc5f35i7hPkGSCDqpUdKkzAxBiU0T82XXjjS/OjMh6CMdlIYmIn5HQ7JEGlM1wJhqkDZTuawyykXGdTK5McCc2LzCWgzgqhpaLYZe56kL4ssyBSJ5x+FhHEN9pPtKQaZ9gFJJEJbIsKD+Y8kM1zbePmbLkiLAi1zQzuBrwsmMBhI9RXHE7vRukMukAMle1dIdHCyzQXFRJjuIUyh3Q0jjQ50w92MsQrMM9sQHBzpvrfJJGRHL3DIQVXLTg84CQRNlz2cQDalHAs+XCeEmm1cZdY9KxTKBbsjqzpnCWPPrlXJ1u7bbnk+mNMpgE02h2N4orku4JV/Dtbq4k+Ve2bJSkOniMoMHBr+QB3V2dpSjLRPcQLGiy67t/HKoK7dr5JFEUhZHmhal3IctswGnT1iOe7XZ293ZNT/UDJwy6LXK3+nV5u9eGAcpJ5EKGJZy5Owmapzpiz1gMC0PbIEJzvGUjCCMMCdynBkPwD1oCRHUgPAXKWRaVxkZ5lLvTkCadaz36UZb3yhVk2/HGY2SVJEoKAaapAyp2HwrKKSCBArWPqQY9gJoRcEMg+HwCck2mODUp9wMLoY7zlc7wx+GvZduacej1hetL1tbLaf1Tetl6/vWaeu8FXR+7vza+aPzZ/eX7m/dv7r/FNCHD0rO5617v+6//wG/+Dxr</latexit>


Algorithm Design


(t�1)n,s ) + v(t�1)



(0)n,s) + v(0)n,s

<latexit sha1_base64="r/Epkkh/AExIZ0WZQ1dKpdemu0k=">AAAOWnicjVZdb9s2FHWzrW3cJE23ve2FmJEh2ZzM8jBsLRCgVZJiD0GQLUtbwHIMSqJtwqSkkFQ+oOlP7mUYsL8yYJeUZMcSvc1AG4o8597Dc68o+gmjUvV6fz5a++jjTx4/ebrefraxufV8+8Wn72ScioBcBjGLxQcfS8JoRC4VVYx8SATB3GfkvT870uvvb4iQNI5+VfcJGXI8ieiYBljB1OjFRuJF5DaIOcdRmHlRrEg+cIaZR3gyzTyTIPNZSvKOk+ftJXAMgX0sCjyfEREh5+B7npoFLaic3XcOXr7kaceZPwIof8jI2zs1GYK/zQd9rSOSqSAcq2mWeYyMlfcbBBJ0MoVRPso6/TxHXrcmTQd4E+JEFeIMj70jQs25Qj9ZWAXBpxM2J8CDWIl36aSguA8p7irKuYj90uDFxjz9v+9n57ATLXWAOg4qdA7zujcnd6XhwIkTkDpGJ963NmbbyrPkPbHnXaKf4ga941xlX+e7cg8qsHOKE4YDgpTAkRyDMfV9U07OdEkH42YUL9HLyF7Li5SfxaeUN/rBkymHGjDKqZLQCU5+ZdqhyV/J/k8uUU3XAO1ZhRJlSZN5Xd35NOz0va6VF5Ixuc6LagjCqtK+yg9rQHIN0CbwMH9lsSxfllG5mEWHTp6d5d6dTKBe9Q3zlNWIXYviJCTC1BL2OhY4gBksFMUMfFyMdTVrTUR+oedNQ3Xv7VZv5l6j84B0YScNKlKzXYF0ZCeZ+hWHQKPe8Kpb9OnTYLc8CBryYNKiTlMGJaUhDiYt2syZk1XnjU3aCm1zcXZ1K+TN9dkFrlC4kGjT6Frtc0v7XJt9rtU+t7TPtdnnWu1zK/tcq32u3T63ss+12ufa7XMr+1yrfa7dPndun2u3T/njgqXIHYz1qZTvIP2A/JiFNTBVeYGEQYmiCjMa1HCSlTgYlDjJcKRIPaAMKmAwB3LMGApwUoOqKrea59a3i1tBFRE1LHxTCywMSqyIYa2e/cHe5dLeJXxRkCSCjmuUNCkjw6DEpol5s2uHml+dmjDYQXpQGpqI+IaGZIE0pmqAMdUgbaZyWUWU84irZHJjgDmzeYW1GMBVlVrNU6/y1AXx5UUFRvKewx/jGNpBeq0UZOa7KJUEYYkMK/MwS6a41nhHmC0uFQGex4J5Bm8TTqY0kGgfxRG7190gF0EBkr2phTs+XkSD60UZ7DhO4VYaQhgfbgozP8YiNGWwBz4+1nFrd5+UEbGILQNRBTcr6CIQNFH2eAbRkPpW4NkiIHzLZlVEvaJSsQigJ7K6c+ZqrPn1u3L1fW23PZ9MaJRBE00i2NBX6OYq24XP3CiLukjm6BB5EQZL0HiU0ZHK0e6tXpJ5AduD7VgBqEDsOxrzDbpZkPTUOkLe9XWKQyDHUcp9sKf9f3Ojf0m+BOnVc8NEM3EbjsGwMmC03ekd9MwPNQdOOei0yt/5aPt3L4yDlJNIBQxLOXB6iRpm+pIRMDDYgwKB1TM8IQMYRpgTOcxMNaCOMBMiuI/Cv0ghM/uQkWEu9XsCSNNR9TU9aVsbpGr84zCjUZIqEgVFonHKkIrNW4tCKkigoAtDiqErQSsKphjuSvAyS22CU99yc/Cuf+B8d9D/ud957ZZ2PG190fqytdtyWj+0Xrd+ap23LlvBxh8bf28+3nyy+dfW2tb61rMCuvao5HzWWvptff4PUzU/+w==</latexit>

Variance reduction stochastic gradient estimator:

v(t)n,s = rfit(w(t)n,s)

<latexit sha1_base64="8Txqxom4J22M9UKNrf6e5lVnjqE=">AAAOn3icnVZdb9s2FHW7r9ZrvHR73B64GRmc1cmiDMPWAgFaJekGLAi8ZUk7WI5BSZRDhJQUksoHNP2s/ZG97d/skpLsWKK3YQbaUOQ59x6ee0XRTxmVamfnrwcP33n3vfc/ePS4++GTtd5H608/PpNJJgJyGiQsEW99LAmjMTlVVDHyNhUEc5+RN/7lvl5/c02EpEn8q7pLyYTjWUwjGmAFU9OnT/7wYnITJJzjOMy9OFGkGDuT3CM8vcg9kyD3WUaKvlMU3SVwAoF9LEo8vyQiRs72tzwzC1pQNbvlbD9/zrO+M38EUHGfUXQ3GjIEf12Md7WOWGaCcKwu8txjJFLe7xBI0NkFjIpp3t8tCuQNG9J0gFchTlUpzvDYGRFqzhX6ycIqCT6dsTkBHsRKvEtnJcW9T3FXUUYi8SuDFxvz9P++n49gJ1rqGPUdVOqcFE1vDm8rw4GTpCA1Qofe1zZm18qz5D20512iH+EWve+c518VA7kJFdg4winDAUFK4FhGYExz35STY13ScdSO4qV6GdlreZLx4+SI8lY/eDLjUANGOVUSOsEpzk07tPkr2f/KJartGqA9q1CiLGlyb6g7n4b9XW9o5YUkIldFWQ1BWF3aF8VeA0iuANoG7hUvLJYVyzJqF/N4zyny48K7lSnUq7lhnrEGcWhRnIZEmFrCXiOBA5jBQlHMwMfFWFez0UTkFzpqG6p7b1C/mZutzgPSiZ00rkntdgXSvp1k6lceAq16w6tu0adPg0F1ELTkwaRFnaaMK0pLHExatJkzJ6/PG5u0Fdrm4uzqVsib67MLXKFwIdGm0bXa51b2uTb7XKt9bmWfa7PPtdrn1va5Vvtcu31ubZ9rtc+12+fW9rlW+1y7fe7cPtdun/KjkqXILYz1qVRsIP2A/ISFDTBVRYmEQYWiCjMaNHCSVTgYVDjJcKxIM6AMamAwB3LMGApw2oCqOrea59a3ixtBFRENLHxTSywMKqxIYK2Z/d7e5dLeJXxRkCSCRg1KllaRYVBhs9S82Y1Dza9PTRhsID2oDE1Fck1DskAaUzXAmGqQNlO5rCPKecRVMrkxwJzZvMZaDOCqTq3mqVd56oL46qICI3nH4Y9xDG0gvVYJMvNDlEmCsESGlXuYpRe40Xj7mC0uFQGex4J5Bm8TTi9oINEWSmJ2p7tBLoICJH/VCHdwsIgG14sq2EGSwa00hDA+3BQu/QSL0JTBHvjgQMdt3H0yRsQitgxEHdysoJNA0FTZ4xlES+prgS8XAeFbdllH1CsqE4sAeiJvOmeuxprfvCvX39du1/PJjMY5NNEshg19ia7P8wF85qZ5PESyQHvIizFYgqJpTqeqQIMbvSSLErYJ27ECUInYcjTmGbpekPTUY4S8q6sMh0BO4oz7YM9/T47+IfsSZKeZHCbambv/I28rBpylYe3idL2/s71jfqg9cKpBv1P9RtP1P70wCTJOYhUwLOXY2UnVJNc3lYBBlTyoMtTrEs/IGIYx5kROclNSaAaYCRFcauFfrJCZvc/IMZf6ZQOkacvmmp60rY0zFX0/yWmcZorEQZkoyhhSiXn1UUgFCRS0ckgxtDZoRcEFhgsXnAhSm+A0t9wenO1uO99s7/6823/pVnY86nza+aIz6Did7zovOz92Rp3TTrD22Zq79tPaUe/z3g+9496ohD58UHE+6Sz9er/9DVmBVys=</latexit>

SGD

SARAH

SVRG

Proximal operator: prox⌘hs(x) := argminw2Rd

⇣hs(w) +

1

2⌘

��w � x��2

⌘

=⌘

1 + ⌘µ

⇣µ w̄(s�1) +

1

⌘x⌘

<latexit sha1_base64="Oxbt0EYyEZ8nb0j+duPTKFAoBew=">AAAPC3ictVdNb9s2GHb32XhL1m7HXbi5KZw1ySIPw9YCAVo1LXYogqxZPwArCSiJtomQkkpSjQuO9132V3bZYcOw6/7Abvs3e0lJdiLR2GkGElPk87x8+PAl9TouGJVqb++fa2+9/c67771/fa3/wYfrGx/duPnxc5mXIiHPkpzl4mWMJWE0I88UVYy8LATBPGbkRXz+0I6/eE2EpHn2g3pTkBOOpxmd0AQr6Dq7uf5ZlJGLJOccZ6mOslwRMw5OdER4MdORm0DHrCRmEBjTvwLOIXCMRYXn50RkKNj9mpduwAqqe3eC3bt3eTkIFo8AMpcZpr/ZkiH4YzMeWR2ZLAXhWM20jhiZqOhHCCTodAYtc6YHI2NQtN2SZgM8SHGhKnGOx54ToRZcYZ88rIoQ0ylbEOBBrMSHdFpRwsuUcBXlSORxbfByYZH9H8f6CFZipY7RIECVzhPT9ubRvDYcOHkBUifoUfSlj9n38jzzPvLPe4X+BHfog+BUf2GGcgt2YPMJLhhOCFICZ3ICxrTXTTk5tFs6nnSjRIUdRv69PC75Yf6E8k4+RLLksAeMcqokZEJgTl06dPkr2f/JJarrGqAjr1CiPNPoaNtmPk0Ho2jby0vJhLwy1W4IwpqtvWf2W0DyCqBd4L6557HMXJXRuKiz/cDoQxPNZQH71V4wL1mLuO1RXKREuL2EtU4ETqAHC0UxAx+XbbubrSQiT+lR11Cbe8PmZG51Mg9Ix37SuCF10xVID/0kt3/VJdDZbzjqHn32NhjWF0FHHnR61FnKuKZ0xEGnR5u7c3Rz3/ikrdC2EOdXt0LeQp9f4AqFS4k+jaHXvrC2L/TZF3rtC2v7Qp99ode+sLEv9NoX+u0LG/tCr32h376wsS/02hf67QsX9oV++1Q8qViKzKFtbyWziewDinOWtsBUmQoJjRpFFWY0aeEkq3HQqHGS4UyRdkCZNMBkAeSYMZTgogVVzdxqMbetLi4EVUS0sPBOrbDQqLEih7H27JfWLq+sXcIbBUki6KRFKYs6MjRqbFm4k9261OLm1oTGJrKN2tBC5K9pSpZIZ6oFOFMd0mcql01EuYi4SiZ3Brg7mzdYjwFcNVOrxdSrPA1BfF2oQEu+4fDlHEObyI7Vglz/NiolQVgix9IRZsUMtxLvIWbLoiLBi1jQz+A04WJGE4l2UJ6xNzYb5DIoQPSDVriDg2U0KC/qYAd5CVVpCmFiqBTO4xyL1G2DP/DBgY3bqn1KRsQytkxEE9yNoONE0EL54zlER+pjgc+XAeFddt5EtCOqFMsAtkO3nXOlseW3a+Xm/drfRMcFSeB9iA7IBIriFB3mylXfVyNhMeV4bjR8R9u21R2mWTMMrX5/LYrJlGYa0nMKz2vuKGhI6TnUxRFRGM3OpEHD+Ra6XRUZqA5zpi9QRDPU7NBTc5rCa95e3xo4w4stdAdVb3Z4oY+QCwYAW/ACdQfNzenIPWclj8H4KOqv3d6vKRZtdHDH0aCiaCLDdCXUEiiCHw36AgquodwJtszlqRwVol8KDesiYEC9SLDzln+ddpn/xypvnd0Y7O3uuQ/qNoK6MejVn6OzG39HaZ6UnGQqYVjKcbBXqBNtq6KEQUZEkFGQG+d4SsbQzDAn8kS79IHEg54UQQENf5lCrvcyQ2Mu7cEGpDsC7THb6Rsbl2ry7YmmWVEqkiXVRJOSIZW7awalVJBEwbFJKYZjBFpRMsNgDtw+sg8mBO0ldxvPR7vBV7uj70eD+2Ftx/Xep73Pe8Ne0Pumd7/3Xe+o96yXrP+0/sv6b+u/b/y88evGHxt/VtC3rtWcT3pXPht//QuVM4hK</latexit>

w(t)n,s � ⌘v(t)n,s

<latexit sha1_base64="k8w4UbtUDbsiKSpTZ5iFxk2bSeU=">AAANynichVZdb9s2FFW7r86Nt7R73Asx10BauGnkYehaIECrON0egiBblrSA5QaURNmESUklqXxA09t+4d72tp+yS0qyE4nGBCS+Is+5PDyXIhlkjEq1t/fPvfufff7Fl189+Lr3cKv/zbfbjx6fyzQXITkLU5aKDwGWhNGEnCmqGPmQCYJ5wMj7YHmg+99fEiFpmvyhbjIy43ie0JiGWEHTxaOH/w7RaUZCihmakBiyROg4Vaa35yfkKkw5x0n0rPCxmHN8XRbw6490ZOmnSdMP0e3+wk9SRcqpOyt8wrNF4RvtRcByUg7csgVOQXOARYXnSyIS5O7+xHPToedatz53d1+94vnAXb0CqLzNKHvDlgzB35XTsdaRyFwQjtWiKHxGYuX/CYkEnS8gKi+KwbgskT/qzEPwtxHOVCXO8Ng5EWrFFfrNwqoIAZ2zFQFexEa8R+cVxbtN8TZRTkQa1AavJ+br/0FQnMBMtNQpGrio0jkr294cXteGAyfNQGqMDv0XNmbPyrOMe2gf9w79CHfoA/dj8azckU+hAsMjnDEcEqQETmQMxrTnTTk51iWdxt0sfqa7kb2Wpzk/To8o76wHX+YcasAop0rCSnDLj2Y5dPkb2f/LJarrGqB9q1CiLMMU/kivfBoNxv7IyotITD6VVTUEYU1pX5f7LSD5BNAucL98bbGsvCujcbFI9t2yOC79a5lBvdoT5jlrEUcWxVlEhKklzDUWOIQWLBRsT+DjOgY77y4h8js96dqpV95O810+7aw7IJ3aSdOG1F2sQDqwk0z1qi2gU2340C369F6wU28DHXnQaFGnKdOa0hEHjRZtZscpmt3GJm2DtpU4u7oN8lb67AI3KFxLtGn0rPZ5tX2ezT7Pap9X2+fZ7POs9nmNfZ7VPs9un9fY51nt8+z2eY19ntU+z26ft7LPs9ungrhiKXINsd6TyiHSLyhIWdQCU1VWSAhqFFWY0bCFk6zGQVDjJMOJIu2EMmyA4QrIMWMoxFkLqpqx1WpsfW25ElQR0cLCiVphIaixIoW+9ui35i7vzF3CeYIkETRuUfKszgxBjc0z82W3trSg2TMhGCId1IZmIr2kEVkjjakaYEw1SJupXDYZ5SrjJpncGGB2bN5gLQZw1QytVkNv8tQD8fU1BSJ5w+HHOIaGSPfVgkz7COWSICyRYcH9j2UL3Fp4B5itrxQhXuWCdgZfE84WNJToOUoTdqNXg1wnBUjxtpVuMllng8tFnWyS5nDdjSBNAPeEZZBiEZky2BNPJjpv6+aTMyLWuWUomuSmB52GgmbKns8gOlLfCbxcJ4STbNlk1D0qF+sEuqFoO2cuxprfvik3p2uv13tydVEkIyThgrEDxxvI84nC6FK3No1PLrYHe7t75kHdwK2DgVM/Jxfbf/tRGuacJCpkWMqpu5epWaFP3pDBuD7oBgVLPCdTCBPMiZwVRiRMD1oiBJc0+EsUMq23GQXmUi8fQBqj23260dY3zVX886ygSZYrkoTVQHHOkErNYkYRFSRUUJyIYigWaEXhAsMFAta47IEJbnvK3eB8vOv+uDv+bTx449V2PHC+d35wdhzXeem8cX51TpwzJ9z6ZYtvXW5d9Y/6on/TLyro/Xs15zvnztP/6z+D3hUB</latexit>

Find the proximal of the descent step,


Algorithm Design

Initial parametersw(0)

n,s = w̄(s�1)

v(0)n,s = rFn(w(0)n,s)

w(1)n,s = prox⌘hs

(w(0)n,s � ⌘v(0)n,s)

<latexit sha1_base64="AMqZQkVTSwfcZ1Wf2LvaXWlXd6Q=">AAAOjnichVbbbtw2EN2kt2QTu0772BeiCxd2sXYsF70khdFEviAPhuHWtRNg5RiUxN0lTEoKSdkOWP1Nv6hv/ZsOKWnXS3FRAfaOyDnDwzNDauKCUal2dv598PCTTz/7/ItHj/tPnq6sfrn27KsLmZciIedJznLxLsaSMJqRc0UVI+8KQTCPGXkbX++b+bc3REiaZ3+qjwW55HiS0TFNsIKhq2dP/44ycpvknOMs1VGWK1KNgksdEV5MdWQX0DErSTUIqqq/4JxD4BiL2p9fE5GhYPtHXtoJQ6gZ3Qq2X7zg5SCYvYJTdR9R9dcdGoIfVaNdwyOTpSAcq6nWESNjFf0FgQSdTMGqrvRgt6pQNHSomQCvU1yompzFsQsi1AwrzJsHVQNiOmEzALyIpf4hndSQ8D4kXAY5FXncCDzfWGT+x7E+hZ0YqiM0CFDN87JytTm8awQHTF4A1TE6jJ77kH0vzrPuoX/dBfgx7sAHwXv9fbUhNyED68e4YDghSAmcyTEI4+6bcnJiUjoad6NEhZlG/lyelfwkP6a8Uw+RLDnkgFFOlYRKCKr3thy6+KXo/8US1VUNvCMvUaI8y+hoaCqfpoPdaOjFpWRMPlR1NgRhbWpfVnuOI/kArl3HveqlR7JqkUaros72gkqfVNGdLCBf7oZ5yRzg0MO4SImwuYS9jgVOYAQLRTEDHee2yaZTROQPetoV1NTeRnsyNzuVB6AzP2jUgrrlCqB9P8jmr74EOvmGo+7hZ26DjeYi6NCDQQ87Axk1kA45GPRws3eObu8bH7Ul3Gbk/OyW0Jvx8xNcwnBO0ccx9MoXNvKFPvlCr3xhI1/oky/0yhe28oVe+UK/fGErX+iVL/TLF7byhV75Qr984Uy+0C+fisc1SpE7sM2tVK0j84LinKWOM1VV7QlG40UVZjRx/CRr/MBo/CTDmSJuQJm0jsnMkWPGUIILx1W1a6vZ2qa7uBVUEeH4wje19gWj8RU5zLmr39u7XNi7hC8KkkTQsQMpiyYyGI1vWdiT7VxqcXtrgrGOjNEIWoj8hqZk7mlFNQ5WVOvpE5XLNqKcRVxGk1sB7J3NW1+PAFy1S6vZ0ss0DYF806iAJT9y+LGKoXVk5hpCdnyISkkQlsiidIRZMcVO4e1jNm8qEjyLBeMMThMupjSRaAvlGftoqkHOg4KLfu2EOziYR4P2ogl2kJfQlaYQJoZO4TrOsUhtGvyBDw5MXKf3KRkR89gyEW1wO4POEkEL5Y9nPTpUjwS+ngeEb9l1G9HMqFLMA5gB7SpnW2ODd3vl9vvaX0dnBUnge4gOyBia4hSd5Mp234uRsJhwfFdp+I2GxupO06ydBqvfj2IyoZmG6pzA62N4vru90tlQQi+zsbNZoT0UQX+ub8273ApgBNqlrOQxSBVFqIbcOBDwwZAgdATjFdpYiLi5EMBdMmjw5iBqOFl30J5HRGE0vZJtJDRbaQvZuZsl8ftwdabt3q7WBjvbO/ZBXSNojEGveU6v1v6J0jwpOclUwrCUo2CnUJfaNCYJg6REkFRIzzWekBGYGeZEXmqbQcg9jKQIelj4yxSyo/cRGnNpzhZ42ip058ygb25UqvEvl5pmRalIltQLjUuGVG5POkqpIImCyk0phkoGriiZYuiv4AKQRoTA3XLXuNjdDn7Y3v19d/AqbOR41Pum921voxf0fu696r3pnfbOe8nKk5Vg5eXKr6trqz+t7q3+Vrs+fNBgvu4tPKtv/gMf6lK2</latexit>

Step 1. Randomly pick a batch


(t�1)n,s ) + v(t�1)


Step 2. Find

Step 3. Update w(t+1)n,s = prox⌘hs

(w(t)n,s � ⌘v(t)n,s),

<latexit sha1_base64="fzGCi+zkXhlM9zG+p7dGhd6uX18=">AAAN93ichVZdb+Q0FJ1dvpZhBrrwyIvF7EgtTEtThGBXqrSbtiseqqpQ2l1pMjtyEs+MVTvJ2k4/FPxHeOEBhHjlr/DGv+HaSWbaxCMitXNjn3N9fPyRG2aMSrW7+++Dh++8+977Hzz6sPtRr//xJxuPP72QaS4ich6lLBWvQywJowk5V1Qx8joTBPOQkVfh5YHpf3VFhKRp8rO6zciE43lCZzTCCpqmj3v9ICHXUco5TuIiSFJF9NibFAHh2aII7ABFyHKiB57W3XvgFBKHWJR4fklEgrydb3luO4ygqnXb23n6lOcDb/kKIH2XobvDhgzBX+rxntGRyFwQjtWiKAJGZir4BRIJOl9ApKfFYE9rFIwa0kyCFzHOVCnO8tgFEWrJFebNwSoJIZ2zJQFexFq8T+clxb9L8ddRTkUaVgavJhaY/2FYnMJMjNQxGnio1DnRTW+ObirDgZNmIHWGjoKvXcyuk+cY98g97j36MW7RB96b4ku9KbdgBYbHOGM4IkgJnMgZGNOcN+XkxCzpeNbOEmSmG7nX8iznJ+kx5a39EMicwxowyqmSsBM8/cZuhzZ/Lft/uUS1XQN04BRKlGOYIhiZnU/jwV4wcvJiMiNvdbkagrB6aZ/p/QaQvAVoG7ivnzks0/dl1C4Wyb6nixMd3MgM1qs5YZ6zBnHkUJzFRNi1hLnOBI6gBQtFMQMfV7FZzcYmIj/R07ahZu9t1idzq7XzgHTmJo1rUnu7AunATbLrV14CrfWGo+7QZ26DzeoiaMmDRoc6QxlXlJY4aHRos3dOUd83LmlrtC3FudWtkbfU5xa4RuFKokuj77TPr+zzXfb5Tvv8yj7fZZ/vtM+v7fOd9vlu+/zaPt9pn++2z6/t8532+W77/KV9vts+Fc5KliI3EJtbSQ+ReUFhyuIGmCpdIiGoUFRhRqMGTrIKB0GFkwwnijQTyqgGRksgx4yhCGcNqKrHVsuxTXVxLagiooGFb2qJhaDCihT6mqPfmbu8N3cJXxQkiaCzBiXPqswQVNg8sye7camF9a0JwRCZoDI0E+kVjckKaU01AGuqRbpM5bLOKJcZ18nk1gB7Z/Ma6zCAq3potRx6nac+iK8KFYjkLYcf6xgaItNXCbLtI5RLgrBEllUEmGUL3Nh4B5itiooIL3NBO4PThLMFjSTaRmnCbs1ukKukACleNNIdHq6yQXlRJTtMc6hKY0gTQqVwGaZYxHYZ3IkPD03eRu2TMyJWuWUk6uS2B51FgmbKnc8iWlJfCny5Sgjfsss6o+lRuVglMA1F0zlbGht+s1auv6/dITrLSATfQ3RIZlAUx+gkVbb6vp8JiznHN7qA32BkonY3TepuiLrdJ9fTIhlJqFw21Vfelkb7CNnjUMC2voHaOCAKo8VUarRpoajEAnIb2b6rVYItvTV60p1uDHZ3du2D2oFXBYNO9ZxON/4J4jTKOUlUxLCUY283U5PCVAARg9kH4B74cInnZAxhgjmRk8JaBSZDS4ygWIS/RCHbepdRYC7NJgakXe5mn2l09Y1zNft+UtAkyxVJonKgWc6QSu2RQjEVJFKwRWKKYcuAVhQtMBQycNKkMcFrTrkdXOzteN/s7P24N3juV3Y86nze+aKz2fE633Wed37onHbOO1FP9X7t/d77o3/b/63/Z/+vEvrwQcX5rHPv6f/9Hzw+JU8=</latexit>

𝒕=𝟏,…,𝝉

Set , where 𝑡′ is chosen randomly in 1,… , 𝜏w(s)n = w(t0)

n,s<latexit sha1_base64="RNADtijBPuGCiCw8gq4GOG4gUcQ=">AAANwXichVZdb9s2FFW7r85NtnR73Asx11gyuFnkYdhaIEOrOMUegiBblraA5RiURNmESUklqSaBpj+5l2H/ZpeUZMcSjRlIfE2ec3l4LkXdIGNUqqOjfx88/OjjTz797NHnvcc7u198uffkqzcyzUVIrsKUpeJdgCVhNCFXiipG3mWCYB4w8jZYnuj5tx+IkDRN/lR3GZlyPE9oTEOsYGj25PE/fkJuwpRznESFn6SKlBN3WviEZ4vCNwsUActJ2XfLsrcBTiFxgEWF50siEuQe/sRzM6EF1aPP3MPnz3ned1c/AVTeZ5S9QUuG4K/LyUjrSGQuCMdqURQ+I7Hy/4JEgs4XEJWzoj8qS+QPW9J0glcRzlQlzvDYGyLUiiv0LwurIgR0zlYE+CG24j06ryjefYq3jXIh0qA2eL0xX/8PguICdqKlTlDfRZXOadn25vS2Nhw4aQZSY3Tq/2Bj9qw8y7qn9nU36Ge4Q++718X35b48gAoMznDGcEiQEjiRMRjT3jfl5FyXdBJ3s/iZnkb2Wl7m/Dw9o7xzHnyZc6gBo5wqCSfBLa/Ncejyt7L/l0tU1zVA+1ahRFmWKfyhPvk06o/8oZUXkZi8L6tqCMKa0r4oj1tA8h6gXeBx+cJiWbkpo3GxSI7dsjgv/VuZQb3aG+Y5axGHFsVZRISpJew1FjiEESwUxQx8XMe6mq1DRP6gF11D9dnbb57Mg87JA9KlnTRpSN3jCqQTO8nUr7oEOvWGR92iT98G+/VF0JEHgxZ1mjKpKR1xMGjRZu6corlvbNK2aFuJs6vbIm+lzy5wi8K1RJtGz2qfV9vn2ezzrPZ5tX2ezT7Pap/X2OdZ7fPs9nmNfZ7VPs9un9fY51nt8+z2eSv7PLt9KogrliK3EOtbqRwg/QMFKYtaYKrKCglBjaIKMxq2cJLVOAhqnGQ4UaSdUIYNMFwBOWYMhThrQVWztlqtrbuLG0EVES0svFMrLAQ1VqQw11793t7lxt4lvFGQJILGLUqe1ZkhqLF5Zp7s1qUWNLcmBAOkg9rQTKQfaETWSGOqBhhTDdJmKpdNRrnKuE0mNwaYO5s3WIsBXDVLq9XS2zz1QHzdqEAk7zh8GcfQAOm5WpAZH6JcEoQlMqzCxyxb4NbBO8Fs3VSEeJULxhk8TThb0FCiZyhN2J0+DXKdFCDFq1a68XidDdqLOtk4zaErjSBNAJ3CMkixiEwZ7InHY5231fvkjIh1bhmKJrmZQZehoJmy5zOIjtTXAi/XCeFdtmwy6hmVi3UCPVC0nTOtsea3e+Xm/doboMuMhPA+RGMSQ1McofNUme57MxMWc45vywK+/aGOutM0aaYh6vWe3syS60L3X+gY3cyKZCihi9lX3x2UT3uzvf7R4ZH5oG7g1kHfqT8Xs72//SgNc04SFTIs5cQ9ytS00O/ykME+fPABdrTEczKBMMGcyGlhNg12wUiEoO2Dv0QhM3qfUWAu9XEEpClce04P2uYmuYp/mRY0yXJFkrBaKM4ZUql5OFBEBQkVFDuiGIoPWlG4wNCSwDMjtQlue8vd4M3o0P3xcPT7qP/Sq+145HzjfOvsO67zs/PS+c25cK6ccOfXnWiH7yS7J7t0N9sVFfThg5rztbPx2S3+A2XoEUc=</latexit>

Send to the server for aggregationw(s)n = w(t0)


Server averages all to find updated global model w(s)n = w(t0)


w̄(s)<latexit sha1_base64="M6QgaY/+HKvAkpdfTNreTWDOKII=">AAANtHichVZdb9s2FFW7r86Nt3R73Asx10BauGnkod1aIECrOMUegiBblqSA5WaURNmESUklqSaBpj+4x73t3+ySkuxEojEDia95z7k8PKSoG2SMSrW39++9+599/sWXXz34uvdwq//Nt9uPvjuXaS5CchamLBXvAywJowk5U1Qx8j4TBPOAkYtgeaDzF5+IkDRN/lA3GZlxPE9oTEOsYOjy0cO/h+g0IyHFDE1IDFUidJwqk+35CbkKU85xEj0tfCzmHF+XBXz7Ix1Z8jRp8hDdzhd+kipSTt1Z4ROeLQrfaC8ClpNy4JYtcAqaAywqPF8SkSB39wXPTUKvtR595u6+esXzgbv6CaDyNqPsDVsyBH9XTsdaRyJzQThWi6LwGYmV/xcUEnS+gKi8LAbjskT+qLMOwd9GOFOVOMNj50SoFVfoXxZWRQjonK0I8ENsxHt0XlG82xRvE+VEpEFt8Hphvv4fBMUJrERLnaKBiyqds7LtzeF1bThw0gykxujQf25j9qw8y7yH9nnv0I9whz5wPxRPyx35BHZgeIQzhkOClMCJjMGY9ropJ8d6S6dxt4qf6TSy7+Vpzo/TI8o758GXOYc9YJRTJeEkuOUHcxy6/I3s/+US1XUN0L5VKFGWaQp/pE8+jQZjf2TlRSQmH8tqNwRhzda+LvdbQPIRoF3gfvnaYll5V0bjYpHsu2VxXPrXMoP9ai+Y56xFHFkUZxERZi9hrbHAIYxgoeB6Ah/XMdh59wiR3+lJ10598naa5/JJ59wB6dROmjak7mEF0oGdZHavugI6uw0PukWfvgt26mugIw8GLeo0ZVpTOuJg0KLN3DhFc9vYpG3QthJnV7dB3kqfXeAGhWuJNo2e1T6vts+z2edZ7fNq+zybfZ7VPq+xz7Pa59nt8xr7PKt9nt0+r7HPs9rn2e3zVvZ5dvtUEFcsRa4h1ndSOUT6BwpSFrXAVJUVEoIaRRVmNGzhJKtxENQ4yXCiSLugDBtguAJyzBgKcdaCqmZutZpbty1XgioiWlh4o1ZYCGqsSCHXnv3W2uWdtUt4nyBJBI1blDyrK0NQY/PMPNmtKy1o7kwIhkgHtaGZSD/RiKyRxlQNMKYapM1ULpuKclVxk0xuDDA3Nm+wFgO4aqZWq6k3eeqB+LpNgUjecPgyjqEh0rlakBkfoVwShCUyLOj/WLbArYN3gNm6pQjxqhaMM3iacLagoUTPUJqwG30a5LooQIq3rXKTyboaNBd1sUmaQ7sbQZkA+oRlkGIRmW2wF55MdN1W55MzIta1ZSia4iaDTkNBM2WvZxAdqe8EXq4Lwpts2VTUGZWLdQE9ULSdM42x5rc75ebt2uv1HvvQIxdX0F/oTunx5fZgb3fPfFA3cOtg4NSfk8vtf/woDXNOEhUyLOXU3cvUrNDv2JDBDD4ohLmWeE6mECaYEzkrjBxYCIxECNox+EsUMqO3GQXmUh8UQBpL2zk9aMtNcxX/MitokuWKJGE1UZwzpFJzbFFEBQkVbENEMWwLaEXhAkOrAKdZ9sAEt73kbnA+3nV/2h3/Nh688Wo7Hjg/OD86O47r/Oy8cX51TpwzJ9xyty62/tzC/Zd9vx/2SQW9f6/mfO/c+fST/wArlg0R</latexit>

Glo

bal i

tera

tions

: 𝒔=𝟏,…,𝑻

For all devices in parallel


Convergence Analysis

ICPP ’20, August 17–20, 2020, Edmonton, AB, Canada Canh T. Dinh, Nguyen H. Tran, Tuan Dung Nguyen, Wei Bao, Albert Y. Zomaya, and Bing B. Zhou

The convergence criterion of the local problem (6) (at a given s) isde�ned as follows

Eh| |r�n (w(s)

n )| | | w̄(s�1)i �

��rFn (w̄(s�1))��, (11)

which is parametrized by a local accuracy � 2 (0, 1), and thus bythe total number of local iterations � . This local accuracy conceptresembles the approximation and inexact factors in [28] and [16, 24]respectively. Here � = 0 means the local problem (6) is required tobe solved optimally, and � = 1means no progress for local problem,i.e., by setting � = 0. Since all devices have the same � , local modelupdates are synchronous. E [·] is the expectation with respect toall randomness in FedProxVR.

Global model update: After receiving all local models sent bydevices, the server will update the global model according to line 12,which will be fed back to all devices for next global iteration update.We also use the expected squared norm of the gradient as conver-gence indicator (i.e., stationary gap) for non-convex problems [3],the global problem (2) achieves an �-accurate solution if

1T

’Ts=1E��rF̄ (w̄(s))

��2 � . (12)

4.2 FedProxVR’s Convergence AnalysisIn FedProxVR, we choose a �xed step size �1, parametrized by � suchthat � = 1

�L . The convergence of local model update is provided asfollows.

Lemma 1. Device n achieves � -accurate solution (11) if � and �

satisfy the following conditionsa) when SARAH update (8a) is used:

0 3(�2L2 + µ2)�2 µ̃L(� � 3) � 5�2 � 4�

8(13)

b) when SVRG update (8b) is used:

0 3(�2L2 + µ2)�2 µ̃L(� � 3) � 5�2 � 4�

8a� 2 (14)

where there exists a > 0 such that a � 4 � 4pa(� + 1).

The following remarks are about relations between the localaccuracy � , number of local iterations � , and step size parameter � .

Remark 1.(1) For an arbitrary � 2 (0, 1], we can always choose a su�-

ciently large � to satisfy both (13) and (14), where the lowerand upper bounds of � are �(�) and O(�2), respectively.Itmeans that with a su�ciently small (�xed) step size �, localconvergence is guaranteed.

(2) We see that � = �( 1� 2 ). Thus if � is smaller, � must be larger

to satisfy lower bound conditions. It is straightforward thatwith a smaller value of � , we have the solution to (6) is closer-to-optimal, which requires running more local iterations.

(3) In practice, since large step size � and thus fast convergence(small � ) are preferred, we choose the smallest �min satisfy-ing Lemma 1 conditions by solving (e.g., in case of SARAH)

3(�2L2 + µ2)�2 µ̃L(� � 3) =

5�2 � 4�8

, � > 3, (15)

1Using a �xed step size is more practical than diminishing step size [3].

and correspondingly obtaining (the smallest) � :

� =5�2min � 4�min

8. (16)

(4) Observing that the lower bound of � is �(µ), thus increasingµ (e.g., to make µ̃ � 0 when � is large) will increase � . Thisis because larger µ will enforce the local update more proxi-mal to the “anchor" point w̄(s�1) in each s , thus making theconvergence to the � -accurate solution more slowly.

(5) Compared to SARAH, SVRG has stricter condition for upperbound (due to a � 4). Thus, SVRG requires a larger �minto satisfy condition (14), and thus larger � (due to the lowerbound). This can be explained that SARAH uses the stochas-tic gradient estimates that are more stable than that of SVRG,which was also validated by a sample dataset in [22]. Wenote that a concrete theoretical comparison between SARAHand SVRG has not been explored before.

De�ning the cost gap of an arbitrary point w̄(0) by �(w̄(0)) :=EhF̄ (w̄(0)) � F̄ (w̄⇤)

i, we next provide the convergence condition

for the global model update of FedProxVR.

Theorem 1. Consider FedProxVR with all devices satisfying condi-tions in Lemma 1, we have

1T

’Ts=1E��rF̄ (w̄(s))

��2 �(w̄(0))�T

. (17)

where

� =1µ

✓1 � �

p2(1 + �̄ 2) � 2L

µ̃

p(1 + �2)(1 + �̄ 2)

� 2Lµµ̃2

(1 + �2)(1 + �̄ 2)◆> 0.

Corollary 1. The number of global iterations required to achieve�-accurate solution to (2) is

T � �(w̄(0))� �

, (18)

Remark 2.(1) We see that � and µ are vital control “knobs” for the conver-

gence of FedProxVR. Speci�cally, to enable � > 0, we haveto choose su�ciently large µ and � < (2(1+ �̄ 2))�1/2, whichshows how data heterogeneity impacts both local and globalconvergence. Speci�cally, larger �̄ 2, thus smaller � , meansdevices will run more local iterations.

(2) These two parameters also characterize the trade-o� betweenlocal and global convergence. While global convergence re-quires � to be su�ciently small, devices prefer larger � forfaster local convergence (in Remark 1). On the other hand,while µ must be su�ciently large to ensure global conver-gence, it should not be too large to have a negative impacton local convergence (i.e., making � large) and global con-vergence (i.e., making � small, thus T large).

(3) Large L and � will require large µ in order to have � > 0.(4) Compared to O( 1� )-iteration of the conventional proximal

SVRG [17] or SARAH [32] (with non-convex and �xed stepsize but without FL setting), we see that FedProxVR withT = O( 1

� � ) is scaled by a federated factor �. Next, we will



Eh| |r�n (w(s)

n )| | | w̄(s�1)i �

��rFn (w̄(s�1))��, (11)



1T

’Ts=1E��rF̄ (w̄(s))

��2 � . (12)





0 3(�2L2 + µ2)�2 µ̃L(� � 3) � 5�2 � 4�

8(13)


0 3(�2L2 + µ2)�2 µ̃L(� � 3) � 5�2 � 4�

8a� 2 (14)








3(�2L2 + µ2)�2 µ̃L(� � 3) =

5�2 � 4�8

, � > 3, (15)



� =5�2min � 4�min

8. (16)







1T

’Ts=1E��rF̄ (w̄(s))

��2 �(w̄(0))�T

. (17)

where

� =1µ

✓1 � �

p2(1 + �̄ 2) � 2L

µ̃

p(1 + �2)(1 + �̄ 2)

� 2Lµµ̃2

(1 + �2)(1 + �̄ 2)◆> 0.


T � �(w̄(0))� �

, (18)







𝜖-accurate solution: 1

T

XT

s=1E��rF̄ (w̄(s))

��2 ✏<latexit sha1_base64="hAAclW2RY+X9BocgQ4ioIKalacU=">AAAOKXichVZdb9s2FHW7r8azt3R73Asxw0BSuGnkYdhaIFirfGAPQZAtTVrAcgJKom3CpKSQVJNA09/Zy/7KXjZgw7bX/ZFdUpLtSDRmINEVec7l4SFFXj9hVKrd3b8fPHzv/Q8+/OjRRvvjTveTTzcff3Yh41QE5DyIWSze+lgSRiNyrqhi5G0iCOY+I2/8+b7uf/OOCEnj6LW6S8iY42lEJzTACpquHne+66OzhAQUM3RAJpAlRCexMr1tLyI3Qcw5jsInmYfFlOPbPIOnN9CRpZ9GVT9Eq/2ZF8WK5CNnnHmEJ7PMM9ozn6Uk7zl5DRyDZh+LAs/nRETI2fmap6ZDz7VsfersPH/O056zeAVQvsrI2/2aDMGP8tFQ64hkKgjHapZlHiMT5f0EiQSdziDKr7LeMM+RN2jMQ/BXIU5UIc7w2AURasEV+s3CKgg+nbIFAV7EWrxLpwXFXaW46yinIvZLg5cT8/R/389OYSZa6gj1HFToHOd1bw5vS8OBEycgdYIOvWc2ZtvKs4x7aB/3Hv0YN+g95zJ7km/JbViB/jFOGA4IUgJHcgLG1OdNOTnRSzqaNLN4ie5G9rU8S/lJfEx5Yz94MuWwBoxyqiTsBCe/NNuhyV/L/l8uUU3XAO1ZhRJlGSbzBnrn07A39AZWXkgm5DovVkMQVi3ti3yvBiTXAG0C9/IXFsvy+zIqF7Noz8mzk9y7lQmsV33CPGU14sCiOAmJMGsJc50IHEALFgqOJ/BxGYOd97cQ+ZGeNu3UO2+r+i63G/sOSGd20qgiNTcrkPbtJLN6xRHQWG340C369FmwVR4DDXnQaFGnKaOS0hAHjRZt5sTJqtPGJm2NtoU4u7o18hb67ALXKFxKtGl0rfa5pX2uzT7Xap9b2ufa7HOt9rmVfa7VPtdun1vZ51rtc+32uZV9rtU+126fu7DPtdun/EnBUuQWYn0m5X2kX5Afs7AGpiovkBCUKKowo0ENJ1mJg6DESYYjReoJZVABgwWQY8ZQgJMaVFVjq8XYumy5EVQRUcPCjVpgISixIoa++ugrc5f35i7hPkGSCDqpUdKkzAxBiU0T82XXjjS/OjMh6CMdlIYmIn5HQ7JEGlM1wJhqkDZTuawyykXGdTK5McCc2LzCWgzgqhpaLYZe56kL4ssyBSJ5x+FhHEN9pPtKQaZ9gFJJEJbIsKD+Y8kM1zbePmbLkiLAi1zQzuBrwsmMBhI9RXHE7vRukMukAMle1dIdHCyzQXFRJjuIUyh3Q0jjQ50w92MsQrMM9sQHBzpvrfJJGRHL3DIQVXLTg84CQRNlz2cQDalHAs+XCeEmm1cZdY9KxTKBbsjqzpnCWPPrlXJ1u7bbnk+mNMpgE02h2N4orku4JV/Dtbq4k+Ve2bJSkOniMoMHBr+QB3V2dpSjLRPcQLGiy67t/HKoK7dr5JFEUhZHmhal3IctswGnT1iOe7XZ293ZNT/UDJwy6LXK3+nV5u9eGAcpJ5EKGJZy5Owmapzpiz1gMC0PbIEJzvGUjCCMMCdynBkPwD1oCRHUgPAXKWRaVxkZ5lLvTkCadaz36UZb3yhVk2/HGY2SVJEoKAaapAyp2HwrKKSCBArWPqQY9gJoRcEMg+HwCck2mODUp9wMLoY7zlc7wx+GvZduacej1hetL1tbLaf1Tetl6/vWaeu8FXR+7vza+aPzZ/eX7m/dv7r/FNCHD0rO5617v+6//wG/+Dxr</latexit>


Convergence Analysis



Eh| |r�n (w(s)

n )| | | w̄(s�1)i �

��rFn (w̄(s�1))��, (11)



1T

’Ts=1E��rF̄ (w̄(s))

��2 � . (12)





0 3(�2L2 + µ2)�2 µ̃L(� � 3) � 5�2 � 4�

8(13)


0 3(�2L2 + µ2)�2 µ̃L(� � 3) � 5�2 � 4�

8a� 2 (14)








3(�2L2 + µ2)�2 µ̃L(� � 3) =

5�2 � 4�8

, � > 3, (15)



� =5�2min � 4�min

8. (16)







1T

’Ts=1E��rF̄ (w̄(s))

��2 �(w̄(0))�T

. (17)

where

� =1µ

✓1 � �

p2(1 + �̄ 2) � 2L

µ̃

p(1 + �2)(1 + �̄ 2)

� 2Lµµ̃2

(1 + �2)(1 + �̄ 2)◆> 0.


T � �(w̄(0))� �

, (18)







𝜃 and 𝜇 and vital control “knobs” for convergence

To ensure Θ > 0:

• 𝜇 should be large

• 𝜃 < 2 1 − 2𝜎! "#/!

Comparison:

• SVRG/SARAH: Ο ⁄# %

• FedProxVR: Ο ⁄# &%

Θ is called the federated factor, which determines the iterations of FedProxVR


Parameter Optimization

Federated Learning with Proximal StochasticVariance Reduced Gradient Algorithms ICPP ’20, August 17–20, 2020, Edmonton, AB, Canada

Figure 1: The e�ect of the weight factor � to the solution tothe problem (23), with L = 1, � = 0.5 (these two values can beestimated by sampling real-world dataset.)

optimize the FedProxVR’s parameters including the federatedfactor.

4.3 Optimizing FedProxVR’s ParametersDenoting the device’s computation (i.e., steps 7 and 8 inAlgorithm 1)and communication delays to send local model updates to the serverby dcmp and dcom , respectively, the total training time of FedProxVRis as follows

T := T (dcom + dcmp� ). (19)

De�ning a weight factor � := dcmpdcom

and T = �(w̄ (0))� � , we minimize

T with convergence conditions as constraints:

minimizeµ ,� ,� ,�

1�

⇣1 + � �

⌘(20)

subject to (15), (16), and � > 0. (21)

By removing constraint (15), (16) and substituting (with SARAH)

�2 =

24(�2L2 + µ2)µ̃L(5�2 � 4�)(� � 3) (22)

into �, we further simplify this optimization problem as

minimizeµ ,�

1�

⇣1 + �

5�2 � 4�8

⌘(23)

subject to � > 3 and � > 0, (24)

which has less variables and constraints than the original form (20).Problem (23) is unfortunate non-convex. However, since there areonly two variables to optimize, we can employ numerical methodsto �nd the global optimal solution. We numerically illustrate howthe weight factor� a�ects to the optimal parameters in Fig. 1. When� is very small, which means communication delay is much moreexpensive than local computation delay, we see that optimal � (andthus � ) is very large, i.e., devices are better to have more localcomputation than communication rounds. When � increases, while

the optimal � decreases so that the local model update can be solvedapproximately with less � , the optimal µ increases to ensure � > 0due to the corresponding increasing value of � . We also observethat large �̄ 2 increases the optimal µ and � , but decreases � and�. All of the numerical observations in Fig. 1 exactly match thetheoretical remarks of Lemma 1 and Theorem 1.

5 EXPERIMENTSIn this section, we will examine the e�cacy of FedProxVR comparedto the SGD-based FedAvg [20] by real-world experiments. We alsoshow how FedProxVR’s empirical convergence relates to its theoret-ical result by varying its control hyperparameters. All codes anddata are ready to be published on GitHub [7].

Experimental settings: To evaluate the performance of Fed-ProxVR on various tasks and learning models, we will use di�er-ent types of datasets in our experiments. Besides a “Synthetic"dataset that captures the statistical heterogeneity as in [16, 26], wealso consider real datasets such as “MNIST" [15] and “FASHION-MNIST" [33] for image classi�cation tasks using both convex andnon-convex models. All datasets are split randomly with 75% fortraining and 25% for testing.

In order to generate datasets for devices that mimic the hetero-geneous nature of FL, we simulate 100 devices for convex models(Multinomial logistic regression) and 10 devices for a non-convexConvolutional Neural Network (CNN) model (since it would takedrastically longer to run a CNN with 100 devices); each of the de-vices has a di�erent sample size, generated according to the powerlaw as in [16]. Furthermore, each device contains only two di�erentlabels over 10 labels. The number of data samples for each device isin the ranges of [37, 3277], [454, 3939], and [37, 1350] with respectto “Synthetic", “MNIST", and “FASHION-MNIST". We implementFedProxVR and SGD-based FedAvg using the Tensor�ow framework.

To allow a fair comparison, all algorithms use the same parame-ters �, � ,N ,T during experiments. In the �nal experiment, the opti-mal hyperparameters for each algorithm are used for performancecomparison. We will deploy image classi�cation with a multino-mial logistic regression model for convex tasks and a two-layerCNN model for non-convex tasks. Regarding the CNN model, wefollow the structure which is similar to that in [20] with two 5x5convolution layers (32 and 64 channels for the �rst and second layerrespectively, max pooling size 2x2 is used after each layer), ReLuactivation, and a softmax layer at the end of the CNN. Althoughmini-batch is not mentioned in Alg. 1, the experiments use mini-batch to reduce tackle challenge of �nding the optimal local pointwith a large number of data points.

E�ects of step-size parameter � and local iterations � : We�rst compare the convergence of FedProxVR and FedAvg in Figs. 2and 3 in di�erent hyperparameter settings. In both �gures, we �rstchoose the value of � , then determine � based on its upper-bound inLemma 1 such that the algorithms empirically converge. While theupper-bound of � only depends on � , its lower bound is determinedby parameters such as L and µ̃ which are more di�cult to estimatefrom the datasets and learning tasks. We start with small valuesof � and � and then increase them to observe the convergencebehavior of FedProxVR and the e�ect of the weight vector � onoptimal parameters � and � .

Federated Learning with Proximal StochasticVariance Reduced Gradient Algorithms ICPP ’20, August 17–20, 2020, Edmonton, AB, Canada

10-3 10-2 10-1 1000.1

0.15

0.2

0.25

10-3 10-2 10-1 100

0.01

0.015

0.02

0.025

10-3 10-2 10-1 100

15

20

25

30

10-3 10-2 10-1 10015

20

25

30




T := T (dcom + dcmp� ). (19)





1�

⇣1 + � �

⌘(20)

subject to (15), (16), and � > 0. (21)


�2 =

24(�2L2 + µ2)µ̃L(5�2 � 4�)(� � 3) (22)


minimizeµ ,�

1�

⇣1 + �

5�2 � 4�8

⌘(23)

subject to � > 3 and � > 0, (24)








Solution using numerical methods


Parameter OptimizationFederated Learning with Proximal StochasticVariance Reduced Gradient Algorithms ICPP ’20, August 17–20, 2020, Edmonton, AB, Canada

10-3 10-2 10-1 1000.1

0.15

0.2

0.25

10-3 10-2 10-1 100

0.01

0.015

0.02

0.025

10-3 10-2 10-1 100

15

20

25

30

10-3 10-2 10-1 10015

20

25

30




T := T (dcom + dcmp� ). (19)





1�

⇣1 + � �

⌘(20)

subject to (15), (16), and � > 0. (21)


�2 =

24(�2L2 + µ2)µ̃L(5�2 � 4�)(� � 3) (22)


minimizeµ ,�

1�

⇣1 + �

5�2 � 4�8

⌘(23)

subject to � > 3 and � > 0, (24)








When 𝛾 is small:• Communication is more expensive

• Optimal 𝛽 (thus 𝜏) is large• Devices are better off having more local

computation than communication rounds

Large $𝜎! leads to • higher 𝛽 and 𝜇• lower 𝜃 and Θ

Devices will have to run more local iterations


Experiments

Datasets:• Synthetic: logistic regression• Real: MNIST and FASHION-MNIST• 75% training, 25% test

Models: convex and non-convex

Federated setting:• 100 users for convex task• 10 users for non-convex task• Data distributed by the power law*

*T. Li et al., “Federated Optimization in Heterogeneous Networks,” in Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020.


Experiments

Effects of step size parameter 𝛽 and local iterations 𝜏


Experiments7HVW�DFFXUDF\

Effects of step size parameter 𝛽 and local iterations 𝜏


Experiments

Effects of proximal penalty 𝜇


Experiments

Comparisons with FedAvgFederated Learning with Proximal StochasticVariance Reduced Gradient Algorithms ICPP ’20, August 17–20, 2020, Edmonton, AB, Canada

Table 1: Comparing the models’ test accuracies using theirbest hyperparameters on a convex task.

Algorithm � � µ B T AccuracyFedAvg 10 10 0 16 983 84.02%FedProxVR (SVRG) 20 10 0.1 32 895 84.12%FedProxVR (SARAH) 20 5 0.1 32 965 84.21%

Table 2: Comparing the models’ test accuracies using theirbest hyperparameters on a nonconvex task.

Algorithm � � µ B T AccuracyFedAvg 20 10 0 16 995 93.52%FedProxVR (SVRG) 20 10 0.01 16 970 94.06%FedProxVR (SARAH) 20 9 0.01 32 958 93.75%

SVRG and SARAH) and a (for FedProxVR using SVRG), thus violatingLemma 1, the learning curves of FedProxVR �uctuate much morenoticeably, although the performances of FedProxVR and FedAvgare still improved and distinguishable. Therefore, with the choiceof � such that its lower- and upper-bound conditions are satis�ed,FedProxVR is expected to converge better than FedAvg.

The performances of FedProxVR and FedAvg on the non-convextask are highlighted in Fig. 3. Here, we observe a similar outcometo our experiment in convex settings, and the performance gapbetween FedProxVR and FedAvg is slightly larger.

E�ects of proximal penalty µ to global iterations T : Weevaluate the e�ect of proximal penalty µ to the convergence ofFedProxVR in Fig. 4. Using FedProxVR on the Synthetic dataset, weobserve that the training loss of FedProxVR diverges when µ = 0,and increasing µ > 0 stabilizes the loss, allowing it to converge.However, it is also noticeable that larger values of µ will makethe convergence of FedProxVR slower. Therefore, µ also re�ectsthe trade-o� between the smoothness of the learning curve andconvergence speed of FedProxVR.

Performance comparison using optimized parameters:Asalgorithms behave di�erently on the same hyperparameters (e.g., µ,� and� in our experiment), we conduct a random search on carefullychosen ranges of hyperparameters to determine which combinationof them would yield the highest test accuracy with respect to eachalgorithm. The result is captured in Tables 1 and 2. It can be seen thatwhen using their optimized hyperparameters, FedProxVR managesto improve its accuracies from FedAvg on both convex and non-convex tasks. Also, while FedAvg performs better on smaller batchsizes on the convex task, FedProxVR bene�ts from larger batch sizes.Finally, on both tasks, FedProxVR starts to converge earlier thanFedAvg.

6 CONCLUSIONSIn this paper, we propose an algorithm for FL using proximal sto-chastic variance reduced gradient methods, which can address theheterogeneity challenges of FL due to massively participating de-vices with non-identically distributed data sources. In the proposedalgorithm, each user device is allowed to independently solve itslearning problem approximately for a number of local iterations

for local model update, which will be sent to the server for globalmodel update. We characterize the convergence analysis for bothlocal and global model updates, which provided several fruitfulinsights for algorithm design. We also propose how to �nd theoptimal algorithm parameters to minimize the FL training time.Using Tensor�ow, we validate theoretical �ndings by presentingempirical convergence of the proposed algorithm on various realand synthetic data sets to show that our algorithm can boost theconvergence speed compared to the SGD-based approaches for FL.

REFERENCES[1] 2017. We Are Making On-Device AI Ubiquitous.

https://www.qualcomm.com/news/onq/2017/08/16/we-are-making-device-ai-ubiquitous.

[2] Mohammad Mohammadi Amiri and Deniz Gunduz. 2019. Machine Learningat the Wireless Edge: Distributed Stochastic Gradient Descent Over-the-Air.arXiv:1901.00844 [cs, math] (Jan. 2019). arXiv:1901.00844 [cs, math]

[3] L. Bottou, F. Curtis, and J. Nocedal. 2018. Optimization Methods for Large-ScaleMachine Learning. SIAM Rev. 60, 2 (Jan. 2018), 223–311. https://doi.org/10.1137/16M1080173

[4] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. 2018. AcceleratedMethods for NonConvex Optimization. SIAM Journal on Optimization 28, 2 (Jan.2018), 1751–1772. https://doi.org/10.1137/17M1114296

[5] Je�rey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V.Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang,and Andrew Y. Ng. 2012. Large Scale Distributed Deep Networks. In Proceedingsof the 25th International Conference on Neural Information Processing Systems -Volume 1 (NIPS’12). Curran Associates Inc., Lake Tahoe, Nevada, 1223–1231.

[6] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. 2012. OptimalDistributed Online Prediction Using Mini-Batches. J. Mach. Learn. Res. 13 (Jan.2012), 165–202.

[7] Charlie Dinh. 2020. CharlieDinh/FederatedLearningWithSVRG. https://github.com/CharlieDinh/FederatedLearningWithSVRG

[8] Saeed Ghadimi, Guanghui Lan, andHongchao Zhang. 2016. Mini-Batch StochasticApproximation Methods for Nonconvex Stochastic Composite Optimization.Mathematical Programming 155, 1-2 (Jan. 2016), 267–305. https://doi.org/10.1007/s10107-014-0846-1

[9] Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, and Alexander J Smola. 2016. Prox-imal Stochastic Methods for Nonsmooth Nonconvex Finite-Sum Optimization.In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama,U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 1145–1153.

[10] Martin Jaggi, Virginia Smith, Martin Takáč, Jonathan Terhorst, Sanjay Krishnan,Thomas Hofmann, and Michael I. Jordan. 2014. Communication-E�cient Dis-tributed Dual Coordinate Ascent. arXiv:1409.1458 [cs, math, stat] (Sept. 2014).arXiv:1409.1458 [cs, math, stat]

[11] Rie Johnson and Tong Zhang. 2013. Accelerating Stochastic Gradient DescentUsing Predictive Variance Reduction. In Proceedings of the 26th InternationalConference on Neural Information Processing Systems - Volume 1 (NIPS’13). CurranAssociates Inc., Lake Tahoe, Nevada, 315–323.

[12] Jakub Konečný, H. Brendan McMahan, Daniel Ramage, and Peter Richtárik.2016. Federated Optimization: Distributed Machine Learning for On-DeviceIntelligence. arXiv:1610.02527 [cs] (Oct. 2016). arXiv:1610.02527 [cs]

[13] Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtárik,Ananda Theertha Suresh, and Dave Bacon. 2016. Federated Learning: Strategiesfor Improving Communication E�ciency. http://arxiv.org/abs/1610.05492 (Oct.2016). arXiv:1610.05492

[14] Yann LeCun, Yoshua Bengio, and Geo�rey Hinton. 2015. Deep Learning. Nature521, 7553 (May 2015), 436–444. https://doi.org/10.1038/nature14539

[15] Y. Lecun, L. Bottou, Y. Bengio, and P. Ha�ner. 1998. Gradient-Based LearningApplied to Document Recognition. Proc. IEEE 86, 11 (Nov. 1998), 2278–2324.https://doi.org/10.1109/5.726791

[16] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, andVirginia Smith. 2019. Federated Optimization for Heterogeneous Networks. InProceedings of the 1st Adaptive & Multitask Learning, ICML Workshop, 2019. LongBeach, CA, 16.

[17] Zhize Li and Jian Li. 2018. A Simple Proximal Stochastic Gradient Method forNonsmooth Nonconvex Optimization. In Proceedings of the 32Nd InternationalConference on Neural Information Processing Systems (NIPS’18). Curran AssociatesInc., Montréal, Canada, 5569–5579.

[18] Zhize Li and Jian Li. 2018. A Simple Proximal Stochastic Gradient Method forNonsmooth Nonconvex Optimization. arXiv:1802.04477 [cs, math, stat] (Feb. 2018).arXiv:1802.04477 [cs, math, stat]

[19] Chenxin Ma, Jakub Konečný, Martin Jaggi, Virginia Smith, Michael I. Jordan,Peter Richtárik, and Martin Takáč. 2017. Distributed Optimization with Arbitrary

Date post:	10-Aug-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Federated Learning with Proximal Stochastic Variance ... · FL with Proximal Stochastic Variance...

Documents