I’ve recently did a hardware migration for a 24/5 trading system. After a successful go-live and one week without any major incidents, I think it would be a good topic to share. Beside the standard things (e.g. project management related topics) you have to do I would like to sum up some points which might me important/interesting to be considered.
Maybe a few words on the existing environment I’ve upgraded. It is basically a 24/5 trading system running on a Solaris 10 machine with an underlying Sparc architecture. The system actually has a lot of Java interfaces for the back-office processing and additionally proprietary interfaces for providing the rates. Beside the core machine, there are additional machines acting as secure gateways for internal and external access as well as web servers. Within the upgrade we switched to an x86 architecture as well as to newer Java, Apache and Tomcat versions. Based upon this setup we have a development an UAT and an production environment in palace.
As mentioned there are some prerequisite, which are company dependent. Mainly project documentation and project management methodology. Furthermore ensure a proper planning of resources and a certain buffers – as there are always unexpected issues. Also involve the business from the beginning, even if you are having just a “simple” upgrade.
A major point for the prerequisites is to always order the hardware in time. Major provider are taking some time to deliver the hardware. Neither the less you can order in different phases – starting with development, followed by UAT and finally the production. This might safe up some operational costs.
Make yourself a plan of software and versions which you would like to be running on. Request proprietary software in time – especially when you are switching on the underlying architecture (from SPARC to x86). Furthermore check the dependencies – certain software require certain versions of underlying tool or runtime environments (e.g. Java).
Track your activities
After having your prerequisites, you can actually start with the hands on work. In my case the operating system and the cluster software has been set-up by another team. Each single task or activity you do on the setup should be documented. This helps in three points:
- It makes it easier to recover from issues/problems or at least to find out the possible root-cause.
- Certain activities (e.g. firewall changes and configurations) will have to be repeated on the other environments – having that documented makes it straight forward and less playing around.
- Having an activity list makes it easier to create a go-live plan.
Within my activity list – I have tracked the following points: activity, date of activity, dependency, responsible, status (new, in progress, done), environment as well as a comments/remarks.
Each project requires proper testing. But testing should not be only the task of the IT – it requires and full front-to-back test. From my point of view the following tests should be included:
- Technical system tests: So basically testing the save start and stop of the application, a full failover, cluster and loadbalancer tests. Test the memory management and the performance of the application. Also you should test the compatibility of software versions – e.g. if you are planning to upgrade Java, there might be parts which might be affected somewhere.
- Connectivy tests: Having a lot of interfaces, each one requires to transfer data from one system to another. To ensure that this will work, each connection (ingoing as well as outgoing) has to be tests.
- Functional business tests: Even if the system version remains unchanged – it is important the the business does a complete test of the system – if everything behaves as it should. This should include (as far as possible) a full end to end test – meaning that the whole business process should be covered.
Only with an Ok from the business you should go forward for the next phase. Meaning after having a business ok for the development environment – we went forward to install the UAT environment. And only after a successful end-to-end testing on the UAT, we start setting up the new production environment.
For the go-live plan it is utmost importance to have a plan and to do everything which was required for the already set-up and running systems. I usually note each single activity, the dependency to other activities, the responsible person, comments/remarks for special operations as well a complete-box, where the activity is marked as “done” after completion.
Also plan that you might need additional resources in areas which are not under your responsibility. This could be support from the Database team, Operating System Support or Network Administration. It is always good to be able to reach them in case of need.
For the go-live planning also think about the migration itself and a fallback scenario. If you have a web-service: are you going to switch the DNS name or IP addresses? Are you going to need new certificates? In my case we simply attached the new servers to existing load balancers. During the go-live there has been two major activities to do the actual migration. The first one was taking the old service from the load-balancer off-line, the second switching on the new service on the load balancer.
If you are able to do a lot of the setup already in advance (or even having a parallel phase) – go for it. The Go-Live/Migration is usually anyway one of the busiest and tense situation of the whole upgrade project. Ensure that you strictly follow you go-live plan and work down step by step.
After having set up everything which is required for a startup – check everything again. Even if you had an extensive testing phase. From my point of view the following points should be considered:
- System status: Operating system, cluster software, memory and hard-disk status.
- Connectivity: Once again – check if all connections are working – you don’t want to have a trading system without rates.
- Accessibility: If you have the ability to log-in/check the application – do it!
Communication is the key! If you upgraded/migrated successfully – inform your colleagues and the business. Follow the principle do good and let everybody know. But not only in regards to self-marketing, but also the ensure that everybody is aware and reactive if issues occurs on the first hours.
A side note: If you are having the migration during the night – also think about having the right amount of coffee and sweets!
Beyond the Go-Live
After the go-live plan some extended support. Unexpected issues might come up – and having that said, you should be able to deal with it. Also (but this might also be a point of the go-live planning) think about a fail-back scenario and make yourself a plan how to switch back.