Cutting down on downtime

Back on April 4th, Virgin Media confirmed widespread internet outages, with users across the UK disconnected, and the company’s own website out of action for a time. This is likely to have caused great disruption for many companies trying to get on top of work priorities in the lead up the long Easter weekend, while not only potentially damaging Virgin’s reputation but also their business customers,’ as well as resulting in significant financial loss. But the Virgin Media outage is far from an isolated case.

A 2022 survey by network resilience company, Opengear, showed the proportion of outages costing over $100,000 has soared in recent times. Over 60% of failures result in at least $100,000 in total losses, according to the report, up from 39% in 2019. The share of outages costing upwards of $1 million increased from 11% to 15% over that same period.

The results of the study underline just how far-reaching the downtime challenge is. In the survey, which polled the views of 500 network engineers and 500 CIOs, separately, 50% of CIOs ranked financial loss among the main impacts on their business due to network outages over the past two years. But the monetary impact is far from the only cost to businesses.

CIOs also referenced customer satisfaction (47%); data loss (45%), loss of reputation (41%); loss of business opportunities/market competitiveness (35%) and SLA pay- outs (24%). Network engineers in contrast, ranked customer satisfaction as the biggest impact (51%) with financial loss second on 29% and data loss third (28%).

These topline survey findings didn’t take into account the less widely-measured but nevertheless undeniable fact that outages can also have a significant impact on every organisation’s most valuable asset – their staff. The stress of coping with an outage and its aftermath can be all-but unbearable for service staff having to deal with unhappy or angry customers. More specifically, downtime can really take its toll on engineers facing long journeys to investigate outages, followed by a battle against time to get systems up and running again.

Outages themselves can often be difficult to avoid. After all, they have a wide range of root causes. Cable interconnects, power supplies, switches, dense compute chassis, storage arrays, and even air conditioning are potential sources of problems. And network devices are only increasing in complexity, with software stacks that are frequently updated and susceptible to bugs, exploits, and cyber-attacks.

As software stacks have to be updated more often, they become more vulnerable to bugs and cyber-attacks. On the one hand, there is a risk of external attacks by cyber-criminals intent on exploiting weaknesses in the corporate network, or external bots constantly looking for vulnerabilities that enable them to penetrate corporate networks. On the other, there is a growing threat from business employees themselves. The causes are just as diverse as the risks – from disgruntled employees who deliberately open the doors to cyber-criminals to good-faith users who are victims of phishing attacks.

Added to all this, the ongoing expansion of networks to encompass edge computing has led to increased compute being pushed to the edge and more complex equipment being put in place in remote locations, where there are no IT staff, and where redundancy is not feasible. In such scenarios, it is no longer sufficient simply to design a robust data centre.

Finally, one of the most common cause of outages is the vulnerability of the primary network’s last mile. While ISP connectivity has improved over the past few years, one weakness these services can’t overcome is the last mile problem.

What this refers to is the final segment of the production network that connects a company network to its ISP. This is the weakest link in a business’s connectivity. All of the network traffic for a single office, store, branch, or distribution centre is funnelled through single links.

The bandwidth of these links effectively limits the amount of data that can be transmitted to your ISP. This bottleneck leaves you exposed to DDoS attacks and basic human error leading to outages. And this last mile can fall victim to physical failure. An accidental fibre cut can knock out an entire network and leave the company disconnected from its internet services for significant periods of time.

In the Opengear survey, more than a third (36%) of network engineers said ‘higher levels of downtime’ were among the biggest risks to organisations from not putting networks at the heart of their digital transformation. Moreover, 37% of engineers ranked ‘avoiding downtime’ among their organisation’s biggest networking challenges post digital transformation.

It was second only to security in the list. 35% of CIOs concurred, although among this group five other challenges including skills shortages, network agility and performance are higher ranked. The low position given to avoiding downtime in the priority list among CIOs is a concern given the shortcomings of many approaches to addressing outages after they have occurred.

It is clear that for businesses generally network outages and the resulting downtime remain a serious issue for many businesses operating today. Yet, the approaches taken by organisations to rectify these problems are often full of shortcomings. Too many businesses still rely on manual ways of working, sending engineers out to site and relying on manual methods of documentation.

So, what’s the way forward? Preparation is key. It is vital that when disruption occurs, companies have an IT business continuity plan that enables them to recover quickly. They need to ensure their network is resilient. Every CIO needs to know without question that when trouble strikes for whatever reason, – whether it’s a hurricane or a cyber-attack, a local power outage or a global pandemic, their network will be ready to deal with it.

With outages still on the up both in terms of prevalence and the average pecuniary loss incurred, organisations need to ensure that their networks are resilient

One priority must be ensuring businesses have visibility and the agility to pivot as problems occur. Many are not proactively notified if something goes offline. Even when they are aware, it may be difficult to understand which piece of equipment at which location has a problem.

To solve errors, an organisation might need to perform a quick system reboot remotely. If this does not work, there may be a problem with a software update or other significant issue. That’s where Smart Out of Band Management using an alternative path into the network really comes into its own.

Relying on the main production network to access a corporate network in the event of a network outage is dangerous because when an issue occurs, an engineer may not have access to that production network. Having access to a separate, secure management plane, in the form of an Out of Band (OOB) management network, ensures remote access to remediate even during an outage, whether caused by a cyber-attack, a misconfiguration, or a network cable being cut in error, for example.

OOB gives organisations an alternate way to connect to their remote equipment such as routers, switches, and servers through the management plane, without directly accessing the device’s production IP address in the data plane and independent of the primary ISP connection the company uses.

This Out of Band path is completely separate from the production network and allows administrators to securely monitor, access, and manage all devices without interfering with normal operations, and even more importantly, without having to allow data plane level access to the management plane.

Since the Out of Band network separates management and user traffic, businesses can lock down, restrict access, and fully secure the management plane. Also, they can configure, manage, and troubleshoot their devices even when the data plane is down. An OOB solution offers organisations a secondary connection, often through 4G LTE, that lets network technicians solve problems from anywhere, and most importantly, saves the company time and money.

While taking account of all the above considerations will be key in raising levels of resilience across business networks, bringing in more automation will also be critically important. Again, this often starts with an independent management plane, which has a vital role to play in automating common network operations (NetOps) processes.

One of the biggest benefits of NetOps is its versatility. It can be there on Day One, enabling the deployment process to be managed via a centralised management software and ensuring network equipment can effectively self-configure.

It is there for the standard day-to-day process of keeping the network running but it can also be to provide an alternative route to remediate the network when it has gone down. NetOps supports rapid resolution of network outages by speeding up the time to resolution.

In the past, if a particular event had happened on the network, most companies would expect an engineer to log in, run through five or six routines to work out what was happening and then remediate the problem. The role of NetOps is to automate that entire process so that when that event happens, the system automatically runs through those five or six steps. If that resolves the problem, fine. If not, the issue is escalated to the network engineer to manage the next level of troubleshooting.

All this simplifies the process. But it also removes human error because so many downtime incidents are simply caused by someone pushing a wrong configuration or typing in the wrong letters when they are sending commands. By using a NetOps approach to correctly program an automation routine, an organisation can effectively remove these challenges.

57% of CIOs in the most recent Opengear survey highlight a reduction in downtime among the benefits of network automation. Companies around the world recognise that the ability to operate independently from the production network and detect and remediate network issues automatically can dramatically improve security, save time and reduce costs. At a time when most businesses are focused on doing more with less, that’s absolutely critical.

It is worth highlighting that time is critical whenever downtime happens. When network outages occur, the damage is cumulative so businesses need to pre-plan and ensure that they are putting in place network resilience as a preventative rather than a reactive approach. Often today the issue is not fully considered upfront.

Organisations often defer discussions around network resilience based on the optimistic hope that a network outage never happens to them. In fact, network resilience should be built into the network from the outset. It should be a tick box exercise but typically it is not. Organisations generally either think that their network is somewhat resilient through the in-band path or they are not thinking about their branches or remote sites as much as they should.

Of course, anyone that has just suffered a network outage will understand the benefits of out of band (OOB), as a way of keeping their business running in what is effectively an emergency but as referenced above it is likely to be much better to plan for resilience from the word go. After all, networks are the ‘backbone’ of most businesses today, and many will benefit from bringing network resilience into the heart of their approach from the outset.

With outages still on the up both in terms of prevalence and the average pecuniary loss incurred, organisations need to ensure that their networks are resilient. A combination of out of band, automation, and NetOps will enable them to do just that.