Wednesday, May 19, 2010

This week I found a new way to mess up a CLICK site upgrade.  I have a server that has been running the latest release, 5.6.0.5, and needed to update a copy of our production 5.5.3 site to the new version for testing.  Just after I started the restore, using the CLICK program "restore553storeto56", a developer started the CLICK "Entity Manager".  Well, after running for 5 1/2 days at 100% CPU usage the migration process timed out.  The  EntEntityEditor process had used 71 hours of CPU on the dual-core server while doing a total of 8 IO - 4 read, 4 write.  I started the process again from the beginning, with the developers logged out, and the restore and migration took 6 hours.

CLICK has a bunch of timeouts, which will kill certain transactions when exceeded.  When a task has failed or an error was spotted in the WOMlog and the issue was traced to a time-out then I've raised the value, often with the consent of CLICK support.  But without really understanding the trade-off implicit in the timeout I have not set it back after the issue was resolved.  This migration would have failed far sooner, and the issue resolved sooner, if the timeouts had been lower.  But even with "Command" and "Connection" timeouts too high I saw timeout errors.  The best solution is a good description of the trade-off and impacts of the parameter.  This is often tough to put together, though.

Learning: lock out the developers when starting a migration job.