Almost two months ago we started experiencing a strange problem. We are using AWS hosting.
Basically the CPU spikes to 100% no matter what the instance is and stays there until we reboot the RDS instance.
That happened rarely, but since we upgraded our Magento version to 1.8.1 (before it was 1.7.2) this happens often.
We tried to optimise the store and check all core files and Magento is clean (no core files modified).
We got to the point now when after every saving of configuration or clearing Magento cache in backend the CPU spikes to 100% and RDS freezes.
There are no server or php errors. Sometimes this happens out of nowhere. So after the CPU goes to 100% the
only option is to reboot the RDS. During that time there are many "php-fpm: pool **DB NAME** " added to the processes.
This results with no errors and is almost impossible to track down. We have no new extensions and before everything was ok.
We have the slow queries log, but they are not that many during the day and I don’t think it will cause the store to freeze
so quick. Also they are mostly from 3rd party extensions and we cannot fix all of them.
We use Varnish to handle our cache. Can you please take a look at this?
Maybe the issue is a misconfigured server or RDS?
Thanks in advance.
Which cache backend do you use, and where are you storing sessions.
In the past we have experianced high load when the cache grew large, Colin Mollenhour's Cm_Cache_Backend_File handles large caches much better then the default Zend_Cache_Backend_File. Alternativly you could use redis Cm_Cache_Backend_Redis. The advantage of redis is that it handles cache evictions itself.
If you store sessions in the database, you may find it is slow to delete expired sessions, this causes the site to slow down because updates to the session tables are blocked whist the delete is in progress.
Sorry, re-reading you question I realise its a problem with a cold cache, you could try using innotop
innotop -h rds-host-endpoint -u username -p password
to look at the activity on the database during these spikes.
Also you could try using strace to look at the fpm processes.