Improving Hadoop Cluster Performance via Linux Configuration

Date post: 12-Jul-2015
Improving Hadoop Cluster Performance via Linux Configuration DevIgnition 2014 – Dulles, Virginia Alex Moundalexis // @technmsg
Improving Hadoop Cluster Performance via Linux Configuration DevIgnition 2014 – Dulles, Virginia  

Alex Moundalexis // @technmsg  

Tips from a former system administrator  

Tips  from  a  former  system  administrator    

CC BY 2.0 / Richard Bumgardner

Been there, done that.  

CC  BY  2.0  /  Richard  Bumgardner  

Been  there,  done  that.  

Tips from a former system administrator field guy  

Tips  from  a  former  system  administrator  field  guy    

CC BY 2.0 / Alex Moundalexis

Home sweet home.  

CC  BY  2.0  /  Alex  Moundalexis  

Home  sweet  home.  

Tips Easy steps to take…  

Tips  Easy  steps  to  take…  

Tips Easy steps to take… that most people don't.  

Tips  Easy  steps  to  take…  that  most  people  don’t.  

What this talk isn't about  

What  this  talk  isn’t  about  

• Deploying  • Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor  

•  Sizing  &  Tuning  • Depends  heavily  on  data  and  workload  

• Coding  • Unless  you  count  STDOUT  redirec:on  

• Algorithms  •  I  suck  at  math,  but  we’ll  try  some  mul:plica:on  later  

"The answer to most Hadoop questions is…  

“The  answer  to  most  Hadoop  ques:ons  is…    

"The answer to most Hadoop questions is… it depends."  

“The  answer  to  most  Hadoop  ques:ons  is…    it  depends.”  

11  ©  Cloudera,  Inc.  All  rights  reserved.  

“The  answer  to  most  Hadoop  ques:ons  is…    it  depends.”  (helpful,  right?)  

So what ARE we talking about?  

So  what  ARE  we  talking  about?  

•  Seven  simple  things  • Quick  • Safe  • Viable  for  most  environments  and  use  cases  

•  Iden:fy  issue,  then  offer  solu:on  • Note:  Commands  run  as  root  or  sudo  

1. Swapping Bad news, best not to.  

1.  Swapping  Bad  news,  best  not  to.  

Swapping  


• A  form  of  memory  management  • When  OS  runs  low  on  memory…  • write  blocks  to  disk  • use  now-­‐free  memory  for  other  things  •  read  blocks  back  into  memory  from  disk  when  needed  

• Also  known  as  paging  

Swapping  


• Problem:  Disks  are  slow,  especially  to  seek  • Hadoop  is  about  maximizing  IO  • spend  less  :me  acquiring  data  • operate  on  data  in  place  •  large  streaming  reads/writes  from  disk  

• Memory  usage  is  somewhat  limited  within  JVM  • we  should  be  able  to  manage  our  memory  • account  for  JVM  overhead  

Limit swapping in kernel  

Limit  swapping  in  kernel  

• Well,  as  much  as  possible.  •  Immediate:    #  echo  1  >  /proc/sys/vm/swappiness  

• Persist  amer  reboot:    #  echo  "vm.swappiness  =  1"  >>  /etc/sysctl.conf  

Swapping peculiarities  

Swapping  peculiari:es  

• Behavior  varies  based  on  Linux  kernel  • CentOS  6.4+  /  Ubuntu  10.10+  • For  you  kernel  gurus,  that’s  Linux  2.6.32-­‐303+  

• Prior  • We  don’t  swap,  except  to  avoid  OOM  condi:on.  

• Amer  • We  don’t  swap,  ever.  

• Details:  hpp://:ny.cloudera.com/noswap  

2. File Access Time Disable this too.  

2.  File  Access  Time  Disable  this  too.  

File access time  

File  access  :me  

•  Linux  tracks  access  :me  • writes  to  disk  even  if  all  you  did  was  read  

• Problem  • more  disk  seeks  • HDFS  is  write-­‐once,  read-­‐many  • NameNode  tracks  access  informa:on  for  HDFS  

Don't track access time  

Don’t  track  access  :me  

• Mount  volumes  with  noatime  op:on  •  In  /etc/fstab:    /dev/sdc  /data01  ext3  defaults,noatime  0    

• Note:  noatime  assumes  nodirtime  as  well  • What  about  relatime?  • Faster  than  atime  but  slower  than  noatime  

• No  reboot  required  • #  mount  -­‐o  remount  /data01  

3. Root Reserved Space Reclaim it, impress your bosses!  

3.  Root  Reserved  Space  Reclaim  it,  impress  your  bosses!  

Root reserved space  

Root  reserved  space  

• EXT3/4  reserve  5%  of  disk  for  root-­‐owned  files  • On  an  OS  disk,  sure  • System  logs,  kernel  panics,  etc  

CC BY 2.0 / Alex Moundalexis

Disks used to be much smaller, right?  

CC  BY  2.0  /  Alex  Moundalexis  

Disks  used  to  be  much  smaller,  right?  

Do the math  

Do  the  math  

• Conserva:ve  • 5%  of  1  TB  disk  =  46  GB  • 5  data  disks  per  server  =  230  GB  • 5  servers  per  rack  =  1.15  TB  

• Quasi-­‐Aggressive  • 5%  of  4  TB  disk  =  186  GB  • 12  data  disks  per  server  =  2.23  TB  • 18  servers  per  rack  =  40.1  TB  

• That’s  a  LOT  of  unused  storage!  

Root reserved space  

Root  reserved  space  

• On  a  Hadoop  data  disk,  no  root-­‐owned  files  • When  crea:ng  a  par::on    #  mkfs.ext3  –m  0  /dev/sdc  

• On  exis:ng  par::ons    #  tune2fs  -­‐m  0  /dev/sdc  • 0  is  safe,  1  is  for  the  ultra-­‐paranoid  

4. Name Service Cache Turn it on, already!  

4.  Name  Service  Cache  Turn  it  on,  already!  

Name Service Cache Daemon  

Name  Service  Cache  Daemon  

• Daemon  that  caches  name  service  requests  • Passwords  • Groups  • Hosts  

• Helps  weather  network  hiccups  • Helps  more  with  high  latency  LDAP,  NIS,  NIS+  •  Small  footprint  •  Zero  configura:on  required  

Name Service Cache Daemon  

Name  Service  Cache  Daemon  

• Hadoop  nodes  •  largely  a  network-­‐based  applica:on  • on  the  network  constantly  •  issue  lots  of  name  lookups,  especially  HBase  &  distcp  • can  thrash  name  servers  

• Reducing  latency  of  service  requests?  Smart.  • Reducing  impact  on  shared  infrastructure?  Smart.  

Name Service Cache Daemon  

Name  Service  Cache  Daemon  

• Turn  it  on,  let  it  work,  leave  it  alone:  #  chkconfig  -­‐-­‐level  345  nscd  on  #  service  nscd  start    

• Check  on  it  later:  #  nscd  -­‐g  

• Unless  using  Red  Hat  SSSD;  modify  nscd  config  first!  • Don’t  use  nscd  to  cache  passwd,  group,  or  netgroup  • Red  Hat,  Using  NSCD  with  SSSD.  hpp://goo.gl/68HTMQ  

5. File Handle Limits Not a problem, until they are.  

5.  File  Handle  Limits  Not  a  problem,  un:l  they  are.  

File handle limits  

File  handle  limits  

• Kernel  refers  to  files  via  a  handle  • Also  called  descriptors  

•  Linux  is  a  mul:-­‐user  system  •  File  handles  protect  the  system  from  • Poor  coding  • Malicious  users  • Poor  coding  of  malicious  users  • Pictures  of  cats  on  the  Internet  

Microsoft Office EULA. Really.

java.io.FileNotFoundException: (Too many open files)  

java.io.FileNotFoundExcep:on:  (Too  many  open  files)  

File handle limits  

File  handle  limits  

•  Linux  defaults  usually  not  enough  •  Increase  maximum  open  files  (default  1024)  

#  echo  hdfs  –  nofile  32768  >>  /etc/security/limits.conf  #  echo  mapred  –  nofile  32768  >>  /etc/security/limits.conf  #  echo  hbase  –  nofile  32768  >>  /etc/security/limits.conf  

• Bonus:  Increase  maximum  processes  too  #  echo  hdfs  –  nproc  32768  >>  /etc/security/limits.conf  #  echo  mapred  –  nproc  32768  >>  /etc/security/limits.conf  #  echo  hbase  –  nproc  32768  >>  /etc/security/limits.conf  

• Note:  Cloudera  Manager  will  do  this  for  you.  

34  ©  Cloudera,  Inc.  All  rights  reserved.  

6.  Dedicated  Disks  Don’t  be  tempted  to  share,  even  with  monster  disks.  

The Situation  

The  Situa:on  

1.  Your  new  server  has  a  dozen  1  TB  disks  2.  Eleven  disks  are  used  to  store  data  3.  One  disk  is  used  for  the  OS  • 20  GB  for  the  OS  • 980  GB  sits  unused    

4.  Someone  asks  “can  we  store  data  there  too?”  5.  Seems  reasonable,  lots  of  space…  “OK,  why  not.”  

Sound  familiar?  

Microsoft Office EULA. Really.

"I don't understand it, there's no consistency to these run times!"  

“I  don’t  understand  it,  there’s    no  consistency  to  these  run  >mes!”  

No love for shared disk  

No  love  for  shared  disk  

• Our  quest  for  data  gets  interrupted  a  lot:  • OS  opera:ons  • OS  logs  • Hadoop  logging,  quite  chapy  • Hadoop  execu:on  • userspace  execu:on  

• Disk  seeks  are  slow,  remember?  

Dedicated disk for OS and logs  

Dedicated  disk  for  OS  and  logs  

• At  install  :me      • Disk  0,  OS  &  logs  • Disk  1-­‐n,  Hadoop  data  

• Amer  install,  more  complicated  effort,  requires  manual  HDFS  block  rebalancing:  1.  Take  down  HDFS  •  If  you  can  do  it  in  under  10  minutes,  just  the  DataNode  

2.  Move  or  distribute  blocks  from  disk0/dir  to  disk[1-­‐n]/dir  3.  Remove  dir  from  HDFS  config  (dfs.data.dir)  4.  Start  HDFS  

7. Name Resolution Sane, both forward and reverse.  

7.  Name  Resolu:on  Sane,  both  forward  and  reverse.  

Name resolution options  

Name  resolu:on  op:ons  

1.  Hosts  file,  if  you  must  2.  DNS,  much  preferred      

Name resolution with hosts file  

Name  resolu:on  with  hosts  file  

•  Set  canonical  names  properly    

• Right    r01m01.cluster.org  r01m01  master1    r01w01.cluster.org    r01w01  worker1  

• Wrong    r01m01          r01m01.cluster.org  master1    r01w01          r01w01.cluster.org  worker1  

Name resolution with hosts file  

Name  resolu:on  with  hosts  file  

•  Set  loopback  address  properly  • Ensure  resolves  to  “localhost,”  NOT  hostname  

• Right  localhost  

• Wrong  r01m01  

Name resolution with DNS  

Name  resolu:on  with  DNS  

•  Forward  • Reverse  

• Hostname  should  match  the  FQDN  in  DNS  

This is what you ought to see  

This  is  what  you  ought  to  see  

Name resolution errata  

Name  resolu:on  errata  

• Mismatches?  Expect  odd  results.  • Problems  star:ng  DataNodes  • Non-­‐FQDN  in  Web  UI  links  • Security  features  are  extra  sensi:ve  to  FQDN  

• Errors  so  common  that  link  to  FAQ  is  included  in  logs!  • hpp://wiki.apache.org/hadoop/UnknownHost  

• Get  name  resolu:on  working  BEFORE  enabling  nscd!  

Summary Now is the appropriate time to take out your camera phone.  

Summary  Now  is  the  appropriate  :me  to  take  out  your  camera  phone.  

47  ©  Cloudera,  Inc.  All  rights  reserved.  

A  white  background  is  supposedly  beper  for  prin:ng.  (who  prints  things  anymore?)  

48  ©  Cloudera,  Inc.  All  rights  reserved.  

A  white  background  is  supposedly  beper  for  prin:ng.  (but  makes  for  very  pale  slides)  

49  ©  Cloudera,  Inc.  All  rights  reserved.  


1.  disable  vm.swappiness  2.  data  disks:  mount  with  noatime  op:on  3.  data  disks:  disable  root  reserve  space  4.  enable  nscd  5.  increase  file  handle  limits  6.  use  dedicated  OS/logging  disk  7.  sane  name  resolu:on  


Recommended reading  

Recommended  reading  

• Hadoop  Opera:ons  hpp://amzn.to/1ydMrLf  

Questions? Preferably related to the talk…  

Ques:ons?  Preferably  related  to  the  talk…  

Thanks! Alex Moundalexis | @technmsg  

Thanks!  Alex  Moundalexis|  @technmsg  

8. Bonus Round Because we have enough time (or I talked really fast)…  

8.  Bonus  Round  Because  we  have  enough  :me  (or  I  talked  really  fast)…  

Other things to check  

Other  things  to  check  

• Disk  IO  • hdparm  • #  hdparm  -­‐Tt  /dev/sdc  •  Looking  for  at  least  70  MB/s  from  7200  RPM  disks  •  Slower  could  indicate  a  failing  drive,  disk  controller,  array,  etc.  

• dd  • hpp://romanrm.ru/en/dd-­‐benchmark  

Other things to check  

Other  things  to  check  

• Disable  Red  Hat  Transparent  Huge  Pages  (RH6+  un:l  6.5)  • Can  reduce  elevated  CPU  usage  •  In  rc.local:  

echo  never  >  /sys/kernel/mm/redhat_transparent_hugepage/defrag  echo  never  >  /sys/kernel/mm/redhat_transparent_hugepage/enabled  

• Reference:  Linux  6  Transparent  Huge  Pages  and  Hadoop  Workloads,  hpp://goo.gl/WSF2qC  

Other things to check  

Other  things  to  check  

• Enable  Jumbo  Frames  • Only  if  your  network  infrastructure  supports  it!  • Can  easily  (and  arguably)  boost  throughput  by  10-­‐20%  

Other things to check  

Other  things  to  check  

• Enable  Jumbo  Frames  • Only  if  your  network  infrastructure  supports  it!  • Can  easily  (and  arguably)  boost  throughput  by  10-­‐20%  

• Monitor  and  Chart  Everything  • How  else  will  you  know  what’s  happening?  • Nagios  • Ganglia  

Questions? Preferably related to the talk…  

Ques:ons?  Preferably  related  to  the  talk…  

Thanks! Alex Moundalexis | @technmsg  

Thanks!  Alex  Moundalexis|  @technmsg  
