High Availability (HA) for Cloud-based Mobile Engagement Services
“What do you mean when you say High Availability?” he asked. “I have spoken with some SaaS providers who have some interesting definitions.”
This question was posed during a recent conversation with a senior IT representative from a major US bank. At first I thought his question was somewhat basic – I mean, don’t we all know what it means to be highly available? As it turns out, the answer is an emphatic “No!” Even for basic platform availability tenets, there are startling gaps in knowledge and ability across cloud-based players. The problem was even more pronounced in our discussion about providing mobile engagement services, as this space involves all of the challenges of mission critical systems, compounded by a dependency on a complex, global network ecosystem.
So how does one make an informed evaluation? How do you separate the wheat from the chaff? To help answer this question, I decompose High Availability (HA) for Mobile Engagement into three core dimensions: Platform and Processes, Network, and Support.
1. Platform and Processes
The Platform is the software powering the service and the Processes are those practices managing that software. I treat these as one category since weakness on one of them will undermine the other. HA disciplines for this dimension have been honed over decades: systems need to be geographically distributed, redundant with no single points of failure, designed to ensure uptime and minimize the risk of data loss…the list goes on and on. It’s a discipline that requires experience, experience many companies lack or have not learned how to effectively apply in the cloud. For example, I have seen companies claim they are HA because they run in a public cloud, as if just by deploying software in such an environment suddenly makes it resilient – it doesn’t. Cloud hosting providers suffer outages, all of them. Those hosting providers even make a point of listing best practices for HA specifying what your software needs to compensate for – running across separate availability zones, avoiding state awareness wherever possible, designing for failure of dependencies…to name a few.
As for the relevance of Processes, I have seen companies run across availability zones, but then deploy faulty software across all zones and bring them all down together. Another example is having a disaster recovery data center, but no evidence of regular testing to prove services will recover in a variety of failure scenarios. To assess this dimension, you need to dig into these questions. Generally parties that run truly active-active platforms across availability zones where the redundancy is constantly proven are going to be superior to ones that depend on “recovery processes” in case of faults, but dig into their processes because even an active-active platform can fail through poor processes. “Can’t I avoid all of this just by having a solid Service Level Agreement (SLA) in place?” No. Committing on paper is not the same thing as having an ability to ensure the SLA, only that there may be repercussions if they fail. The burden is on the buyer to gain the confidence that their provider can meet the service level.
The Network is the set of interconnections used by the Platform to communicate with mobile networks of end users across the globe. This is one of the challenges in the mobile space: mobile networks are complex, with variant behaviors, features, requirements, and costs to deal with. To achieve HA in this dimension requires more than redundant telecommunications, a network needs as many truly independent paths of higher order communications as possible. For example, at first glance you might assume that redundant SMS connections to diverse carrier data centers is bulletproof, but those carriers often have special platforms for A2P messaging, common routing logic across data centers, and other platforms or processes that can cause all traffic on a short code to fail unilaterally – making the redundant connections useless. To be truly resilient, a Network needs diversity, which includes a combination of direct connections, inter-carrier forwarding paths, short code paths, long code paths, virtual mobile number paths…again, the list goes on. For a proper assessment of Network HA, you should expect your mobile engagement provider to describe the types of challenges your traffic could face and how they are going to be able to overcome these obstacles to keep it flowing. Keep asking questions until you understand – their experience will (or won’t) shine through.
Support is your ultimate fallback. With the complexity of the global mobile channel, you need to partner with a company that has the domain expertise to properly serve you – avoiding issues where possible and effectively dealing with the inevitable times when things go awry. This goes beyond “follow-the-sun” technical support, beyond contacts that natively speak your language, and beyond how many people are in call centers. You should ask for the depth of knowledge in the Support staff. You should ask about how they manage relationships with the global mobile networks. You should also ask how they stay on top of rules and regulations around the world. In addition, you should ask how they deal with the human elements of providing global, 24×7 mobile services. Experienced providers will excel in these areas and be capable of explaining the complexities they are dealing with on your behalf – those that cannot explain probably lack the ability to manage it.
As a mobile industry expert for over 15 years, OpenMarket is helping global enterprises use mobile engagement services company-wide. Contact us today for more information.