When we upgraded our Ceph cluster from filestore to bluestore, our iowait went through the roof. Users were complaining about their web pages hanging all the time and all sorts of issues.
This year in iowait:
- Between May and June we upgraded to Bluestore. The iowait seems to have slowly crept upwards after that.
- In September we added a bunch of SSDs and moved all our running VMs to them. This resulted in a big drop in iowait.
Despite us making a drastic reduction in iowait, it was costly and also hasn’t solved our problem completely. We still have issues where our entire cluster will lock up (all VMs not responding at the same time) for brief moments here and there every day.