I'm writing in hope anyone can give me another, fresh perspective on a performance issue which our agency is not able to resolve for the last two weeks of intense research. Please excuse me for my limited understanding of the technical background and feel free to correct any of my assumptions.
Approx. 2 weeks ago something changed in our M1 setup as since then we are experiencing site outages during working days multiple times throughout the day for 1-15min.
General server performance before issues started on July 9th:
after July 9th we see CPU peaking, during that times site goes also down:
It all started when a hotfix for price calculation rounding issues was deployed although devs are pretty sure the fix is not the root cause and they deny any correlation. Since then I heard a lot of hypotheses including:
1) Google bot traffic (which was proven not to be unusual high)
2) Redis issue (restart of redis helped few times to stop the issues)
3) External calls against Magento's API (stock, product import) are piling up and blocking other processes -> nothing was changed here comp to the time before the issues started
4) Staging is adding load to the production environment since it is running on the same server. <- this was never an issue before, stage is not used that much, there are no cronjobs running etc.
5) Stock import via xml which imports stock in batches of 10k skus for a period of 5-6h-> this was not an issue before
What I see in NewRelic, our internal analysis/performance monitoring tool, as a weird pattern is an increase in following calls against MySQL DB (MySQL salesrule select) when the issue occured the first time (July 9th), see following screenshots. I'm unfortunately not understanding what exactly is happening here, as I only see the peaks happening same time when site becomes unstable and goes down. That looks to me like its correlating but devs are explaining the increase in calls here as:
"the cases shown here are those extremes when site is not working properly those calls are related to accesing pages that take 400s+ (7-8mins) and they happen only during the server problems"
Before that time this specific call looked fine:
Maybe anyone has a rough idea or other hypotheses what might be going wrong here?
Hey I also faced such kind of issues in Magento 1 and site went down, when I checked it was related to sales rule coupon code which was enabled and that coupon was using some products as freebie but that freebie didn't have any stock so it was going in loop and consuming CPU 100% so you can try to disable your salesrule.
Apart from that please check your server and Magento logs, there you will find some errors for sure. Thanks
Thank you Manish!
I checked all rules and we don't have any active with freebies.
Can an inactive catalog rule also mess things up?
No inactive rule should not, can you check logs: server logs and magento logs for more information.