Downtime, failures and breakdowns – Understand real costs
- April 11, 2019
- Written by: Gad Cohen
One
This content is brought to you by Evolven. Evolven Change Analytics is a unique AIOps solution that tracks and analyzes all actual changes made to the enterprise cloud environment. Evolven helps leading companies reduce the number of incidents, improve problem resolution time and eliminate unauthorized changes.To know more
When it comes to mission-critical applications or data center performance quality, companies are willing to make big investments. Unfortunately, these investments don't always deliver full performance.
Dealing with system downtime
Despite the efforts that have been made in infrastructure resiliency, many IT organizations continue to deal with database, hardware and software outages that last from just a few minutes to several days, completely crippling the business and causing huge losses.
expected downtime
Sometimes the world of IT outages can feel uncomfortable.
Despite the array of advanced solutions and the growing amount of data collected by leading enterprise software vendors and IT departments (from ERP to CRM and beyond), outages are still a valid and daunting threat to the industry.
On the other hand, IT outages have somehow become an inherently accepted, even expected, part of corporate life.
This is counterintuitive...
Revised IT downtime
While IT professionals experience downtime from time to time and then focus fully on managing it, the business organization as a whole suffers from "financial hardship" due to effects that are often very significant.
In the past, we've delved deep into the various ways that IT downtime can impact organizations' bottom lines (you can read more about it here –Cost and scope of unplanned outages). We look at a range of aspects, from direct lost sales and reputational damage to indirect effects such as lost productivity.
Now, I want to revisit the topic and examine how organizations should approach and assess threats to their IT operations, including systems, applications and data, looking at robust (and established) benchmarks that represent the potential costs of downtime and outages.
System failures:
Measuring the failures of big brands
When should the industry start measuring the financial impact of major brand failures like the ones that occurred recently?Facebook, to dieone that reached hundreds of thousands of Lloyds Bank customers, or thejetstar failurethat caused hundreds of flight delays?
In other words, when is an outage "significant enough" that a cost analysis becomes valuable for the industry to learn and predict the impact of future outage events?
Well, apparently, at some point, the outage had an impact that cannot be ignored in terms of public relations. This is the point of no return, followed by estimates of the financial impact.
The cost of downtime varies significantly across industries. The size of the company involved is obviously a critical factor, but not the only big one. The role of IT systems in the company is also crucial.
Putting a numerical value behind an IT outage means pre-defining its impact on multiple business and organizational aspects so the entire industry can learn and optimize accordingly.
A failure of a critical application can result in two different types of losses:
- Application Service Outage – The impact of downtime varies by application and organization;
- Data Loss – Possible data loss due to a system failure can have significant legal and financial implications.
I'm sure you would agree that today's data centers should never go down; Applications need to be available 24/7, and internal (let alone external) end users around the world need to be able to rely on data center availability (for critical data and application availability) all the time. all.
Well, reality bites. This is not the case in the back office (ie inside the data center). No organization enjoys 100% uptime. If you want to reach 100%? Safe. But you must also develop a deep understanding of the impact of downtime and ways to minimize it.
Worst nightmare ever? Probably what happened to you...
Some earlier outages turned into public relations disasters, like the mythological Virgin Blue disaster of 2010 or the more recent one that affected Facebook.
Why? The mass impact probably had something to do with it.
As a reminder, the Virgin Blue outage prevented passengers from boarding flights for 11 days (!!), resulting in negative press, damaged reputations and millions in losses.
More specifically, Virgin Blue's reservation management company, Navitaire, ended up compensating Virgin Blue for over $20 million (Navitaire booking error brings Virgin $20 million in make up).
There are many other incidents that still attract media attention. Here is just a recentUSA Today article on the Wells Fargo outagethat prevented customers from accessing their accounts for many hours.
I can safely say that anyone in IT would agree that failures or downtime are VERY bad for business. They are undesirable, very harmful financially and must be fought with all available means.
Configuration errors are the key
The IT Process Institute's Visible Ops Handbook has reported in the past that "80% of unplanned outages are due to poorly planned changes made by administrators ('ops') or developers" (visible operations).
The Enterprise Management Association reported that 60% of availability and performance failures are due to misconfigurations.
How much does it cost?
Downtime can cost organizations $5,600 per minute and up to $300,000 per hour in web application downtime (per2014 Gartner Analysis).
Average cost per hour of corporate server downtime worldwide, 2017-2018:
Those ones:political
Application maintenance costs are increasing by 20% per year. But that can't solve all your problems. Previous industry research found that at least a quarter of surveyed downtime was caused by configuration errors. (How much will you spend on app downtime this year?).
How common is downtime or breakdowns?
Okay, downtime can be a financial nightmare. That part is clear. But if you want to properly assess the potential risk of failure for your business, the immediate question must be, "How likely is this to happen?"
Those ones:data center knowledge
Okay, glitches are too common to ignore thinking "I probably won't see a major glitch". Now the question arises how you can calculate the specific risk for your company.
Production and application downtime costs made transparent
Unplanned outages must be resolved by IT. However, and as already mentioned, ultimately these failures affect the entire organization.
An important part of a thorough downtime risk assessment is estimating how much money you will lose per hour (or minute, or whatever time interval you choose) in the event of downtime.
For companies that rely solely on the ability of data centers to deliver network and IT services to customers – such as telecom service providers or e-commerce companies – downtime can be particularly costly, with the highest cost of a single event more than $1 million (over $11,000 per minute), according to expert estimates.
In a USA Today survey of 200 data center managers, more than 80% said their downtime costs exceeded $50,000 per hour. More than 25% reported downtime costs greater than $500,000 per hour (!!).
According to another survey, while companies cannot achieve zero downtime, one in ten say their availability needs to be greater than 99.999%.
Those ones:Searchcio Techtarget
To get a complete understanding of the impact of production and release downtime, let's take a look at how the consequences of downtime manifest.
Downtime costs - per year or per incident?
AStudy 2017found that 46% of 400 IT decision makers experienced more than four hours of IT-related downtime in a 12-month period; 23% said they incur costs between $12,000 and more than $1 million per hour.
More than 35% admitted they are unsure of the cost of an outage to their business.
If you ask Delta Airlines, which had to cancel 280 flights due to a glitch in 2017, the losses from a single glitchcan reach more than 150 million dollars.
A few years ago, Dun & Bradstreet reported that 59% of Fortune 500 companies experience at least 1.6 hours of downtime per week.
If you take an average Fortune 500 company (or a company with at least 10,000 employees) and assume that it pays an IT staff member an average of $56 an hour, then (assuming all of IT does this) is to fix downtime) just work Some downtime for a company of this size would amount to $896,000 per week, which is over $46 million per year (Assessing the financial impact of downtime).
Of course, reality is more complicated, as many parameters must be taken into account, such as the time of the event (midweek or weekend? day or night?) and much more. However, understanding the cost of downtime goes a long way in estimating your risk exposure and the ROI of tools that can help minimize the impact of downtime.
Has the industry been able to learn from the past and minimize collateral damage in the event of a disruption?
How have things changed from the past?
So, we already know that there is still downtime and disruptions that the industry has yet to successfully eliminate. But how have your costs changed over time? Are these incidents less harmful today?
Em 2010,an investigation by Coleman Parkesfound that IT downtime costs companies a total of more than 127 million hours per year - an average of 545 hours per company - in employee productivity.
In 2009, it was reported that the average cost of downtime varies significantly by industry, from around $90,000 per hour in the media industry to around $6.48 million per hour for large online brokers (How to quantify downtime).
According to a survey of IT managers conducted during these years, companies are becoming more aware of the direct financial cost of computer failures. Research found that one in five businesses lose $12,000 per hour due to system downtime (How to quantify downtime).
As noted above, further analysis conducted by Gartner in 2014 found average costs of $5,600 per minute and over $300,000 per hour.
As early as 2004, a conservative estimate by Gartner put the cost per hour of computer network downtime at $42,000. As a result, a company that suffers an above-average 175 hours of downtime per year can lose more than $7 million annually. However, the cost of each outage affects each company differently, so it's important to know how to calculate the exact financial impact (How to quantify downtime).
It makes sense to think that the cost of downtime will only increase over time (since we are more reliant on data systems today). You can therefore understand why past dates can be multiplied by a significant number to reflect today's reality...
every minute counts
More than a decade ago, the average cost of data center downtime across all industries was estimated at approximately $5,600 per minute (Unplanned IT outages cost more than $5,000 per minute), an account numbergardener, remained the same until 2014. The previous Ponemon Institute study mentioned above calculated the minimum, average, average, and maximum cost per minute of unplanned outages, based on inputs from 41 data centers. It turned out that the highest cost of an unplanned outage exceeds $11,000 per minute.
On average, the cost of an unplanned outage is likely to exceed $5,000 per minute.
It just gets more meaningful
AStudy 2013saw an increase of more than 41% over the previous averages described above and averaged more than $7,900 per minute.
AITIC Research 2015clearly showed that hourly costs have increased by 25% to 30% (compared to 2008 data).
Impact of downtime per year
A previous Gartner analysis calculated that downtime can average 87 hours per year. Obviously, this is the sum of many failures - from a few minutes to several hours (An average large enterprise experiences 87 hours of network downtime per year).
How have things changed?
one later2011 surveyrevealed that while the industry has been successful in combating the downtime epidemic and reducing its frequency, we are still seeing significant downtime and huge revenue losses (source:resulted in over 3 million (apparently Whatsapp users) switching to Telegram)
The impact on reputation and loyalty
How much is your company's reputation worth? This can be extremely difficult to assess, as can the long-term impact of a damaged reputation and its impact on sales and profitability.
In this case, the cost of downtime includes lost customers (short and long term) and other tangible items that reflect the cost of reputational damage, such as inventory decline, marketing time (crisis management and brand recovery) and budget. of media needed to reboot and revitalize an organization's profile.
Which parameters should influence your calculation?
When trying to estimate the cost of downtime, there are the obvious direct costs (eg lost business during downtime). However, many indirect costs such as employee overheads or reputational issues mentioned above must also be taken into account.
Staffing costs come from the cost of firing “war room” tasks focused on getting IT systems up and running again, the cost of delays in all other scheduled tasks, the cost of staff overtime (if applicable) and a lot more. Add to this the value of data loss, emergency maintenance fees (especially if the outage occurs outside of business hours), and additional repair costs that can persist long after service is restored.
Obviously, you need to calculate these costs when estimating the impact of downtime, as they are often very significant. But even a rough estimate can be extremely helpful in understanding the risks and deciding what level of technology to rely on to combat them.
There is also the impact of lost sales. To get an accurate estimate of total lost sales, the impact percentage needs to be increased to reflect the true lifetime value of customers permanently switching to a competitor. For example, the Facebook (and Whatsapp) outage I mentioned earlierCost Conscious: Denying the true cost of network downtime. What revenue loss results from these users displaying fewer billable ad impressions?
Stock dropped 25%
Although it is difficult to quantify so many parameters, they are significant and significant. For example, when Amazon.com was offline for several hours in its early days, its inventory dropped 25% in a single day (Cost Conscious: Denying the true cost of network downtime)!
In thatAmazon Cloud outageFor example, the company continued to struggle to get its cloud services back online. As a result, many customers questioned the reliability of their cloud and Amazon's communications related to the outage. Other customers felt that they should be compensated for downtime as part of their SLA.
I know you're curious: As for the SLA, despite the nearly four-day outage, Amazon's EC2 SLA was not violated (Seven Lessons Learned from Amazon's Failure).
Downtime costs: calculate for yourself
How much do you stand to lose with an unexpected server or business application failure?
According to various sources, the easiest way to calculate the potential for lost revenue during an outage is to use this equation:
LOSS OF PROFITS | = | (GR/TH) x I x H |
GR | = | gross annual sales |
º | = | total annual working time |
EU | = | percentage impact |
H | = | Number of hours lost |
How to minimize the risk of breakdowns and stoppages?
Downtime and failures are catastrophic, but they don't have to be so severe. By using solutions that focus on getting to the root of the problem, failures can be prevented before they even happen.
Evolved change analysisDeveloped a unique AIOps solution focusing on change - the real cause of performance incidents. Evolven helps enterprise IT and cloud operations teams prevent and resolve incidents before problems occur.
contact usto see how we're helping leading companies reduce incidents and MTTR.
FAQs
What are the costs associated with downtime? ›
Relatively small businesses' cost of downtime falls into a range of $137 to $427 per minute, whereas for larger businesses, the downtime can cost over $16,000 per minute ($1 million per hour) for just a short outage.
What is an example of downtime cost? ›To get a quick estimate of your company's probable downtime costs, use the following formula, based on the size of your business and the number of minutes your most recent incident lasted: Downtime cost = minutes of downtime x cost-per-minute. For small business, use $427 as cost-per-minute.
What is the difference between downtime and outage? ›Downtime occurs when a system can't complete its primary function. It can be broken up into two types: IT outages and brownouts. IT brownouts occur when a system is slowed or partially available. This might mean customers can access your site, but pages load slowly or dynamic features like "add to cart" don't function.
What is downtime failure? ›In industrial environments, downtime may refer to failures in production equipment. This type of downtime is often measured as downtime per work shift or downtime per a 12- or 24-hour period. Downtime duration is the period of time when a system fails to perform its primary function.
What are the three types of downtime? ›Common categories of downtime include excessive tool changeover, excessive job changeover, lack of operator, and unplanned machine maintenance.
What is the most common cause downtime? ›Human Error: Regardless of whether accidental or due to negligence, human error is one of the most common causes of unplanned downtime. An employee unintentionally deleting data or accidentally unplugging a cable or not following standard protocols can lead to costly downtime.
What are the two types of downtime? ›Downtime falls into two categories: planned and unplanned. Planned downtime is notable because it offers advanced warning and gives users a chance to prepare. Planned downtime is usually done for upgrades or maintenance to the network infrastructure.
How do you explain downtime? ›a time during a regular working period when an employee is not actively productive. an interval during which a machine is not productive, as during repair, malfunction, maintenance.
What is the industry standard for downtime? ›World Class Standards For Downtime
Aim for unscheduled downtime to be 10% or less.
Downtime is also known as idle time.
What is a major outage? ›
More Definitions of Major Outage
Major Outage means any Power Outage that lasts for at least ten (10) consecutive minutes and/or any Temperature Irregularity, in each case causing inoperability of Customer's Equipment.
- Defects.
- Overproduction.
- Waiting.
- Not-Utilizing Talent.
- Transporting.
- Inventory.
- Motion Waste.
- Excess Processing.
Downtime is the period during which equipment is not operational. This situation is caused by such factors as maintenance, setup for a job, broken equipment, or missing inputs, such as raw materials or qualified operators.
What are the consequences of downtime? ›Consequences of unplanned downtime
Lost productivity and revenue: Every minute of downtime can result in lost productivity and revenue, affecting a business's bottom line. Decreased customer satisfaction: Unplanned downtime can lead to delayed deliveries, canceled orders, and frustrated customers.
- Volunteer. There are only a few things that feel better than genuinely making a contribution and helping other people. ...
- Write down everything you're grateful for. ...
- Meditate. ...
- Do something creative. ...
- Spend time in nature. ...
- Organize your space. ...
- Go over and personalize your devices' settings. ...
- Go for Inbox Zero.
Human Error and Accidental EPO
Often the most cost-effective root causes to address, human error and accidental “emergency power-off” (EPO) remain leading causes of data center outages.
Server downtime is defined as any time your server is unavailable or unresponsive. Various factors, including hardware failures, power outages, network problems, and software issues, can cause this.
What causes a major IT outage? ›Servers can crash for a variety of reasons: faulty ram, power supply glitches, damage to hard disk platter, bugs, firmware upgrades and more. In order to avoid IT outages from unstable servers, servers need 24/7 Network Monitoring. Monitoring can ensure timely replacement and a healthy environment to avoid downtime.
What is a downtime plan? ›Planned downtime is scheduled time when production equipment is limited or shut down to allow for planned maintenance, repairs, upgrades or testing.
What is downtime behavior? ›Downtime behavior determines how events related to a CI are handled when received while that CI was in downtime. To access. Administration > Event Processing > Automation > Downtime Behavior. Alternatively, click Downtime Behavior.
How can we minimize the risk of system downtime? ›
- Test Server Backups On A Regular Basis. When a server goes down, you can mitigate damage by restoring it quickly. ...
- Utilize Cloud Solutions. ...
- Keep Everything Up To Date. ...
- Invest In Reliable Equipment.
Availability is normally expressed in 9's. For example, “5 nines uptime” means that a system is fully operational 99.999% of the time — an average of less than 6 minutes downtime per year. The chart shows what impact various availability levels have on your server downtime.
How do you handle downtime at work? ›- Offer to help a colleague or manager. ...
- Organize and clean your workspace. ...
- Go for a walk. ...
- Clean your email inbox. ...
- Read industry news. ...
- Compile a list of contacts. ...
- Record your voicemail greeting. ...
- Write a note of appreciation.
The most well-known downtime metric is Mean Time to Repair (MTTR). The MTTR metric reflects the average time it takes to troubleshoot and repair a failed piece of equipment.
Why is it important to reduce downtime? ›Importance of Reducing Unplanned Downtime
Waiting on parts or the necessary personnel to fix an issue takes time and could mean the machine is going to stay down for longer. Longer downtime is less time making product, directly effecting the bottom line.
Network downtime means that your customers can't access your online services. They can't find or buy your services and products. If your potential customers can't access your website, then it will affect your revenue. Also, your existing customers can't access your products and services.
Is downtime a KPI? ›Revenue is directly impacted by downtime because the less equipment is running, the fewer products are made and sold. Therefore, one of your maintenance KPIs is downtime. All sorts of quantifiable actions can influence downtime, such as the mean time to repair (MTTR) or planned maintenance percentage.
What was the biggest outage in the US? ›What Happened? August 14 and 15, 2003 - The northeastern U.S. and southern Canada suffered the worst power blackout in history. Areas affected extended from New York, Massachusetts, and New Jersey west to Michigan, and from Ohio north to Toronto and Ottawa, Ontario. Approximately 50 million customers were impacted.
What is an example of a planned outage? ›Planned outages are deliberate and are scheduled at a convenient time, for example, for the following purposes: Database administration, such as offline backup or offline reorganization. Software maintenance of the operating system or database server. Software upgrades of the operating system or database server.
How long is a forced outage? ›An outage (whether full, partial, or attributable to a failed start) is considered "forced" if it could not reasonably be delayed beyond 48 hours from identification of the problem, if there had been a strong commercial desire to do so.
What are the financial impacts of downtime? ›
The cost of downtime = downtime duration x per-minute cost.
You can use around $400 as a cost-per-minute figure for small enterprises. In the case of large and medium businesses, use $10,000. Many people only associate downtime costs with lost revenue.
Production downtime report measures the duration and cost of downtime for a production resource, operation, and work center. Both planned and unplanned downtime are measured, and cost is calculated based on the cost per unit of time and total downtime.
Why is IT important to record downtime? ›Tracking downtime allows maintenance managers to identify areas where processes could be causing unnecessary delays and bring clarity to questions on an asset's availability.
What is downtime importance? ›Downtime restores attention and #motivation, fosters #creativity, improves work #efficiency and is essential for #peak performance. Think about the word recreation for a second and break it apart.
What is the true cost of downtime report? ›This new report suggests that unplanned downtime now costs Fortune Global 500 companies 11% of their yearly turnover – almost $1.5tn. This is up from $864bn (8% of turnover) two years ago.
What is downtime and how can IT affect a business? ›Network downtime means that your customers can't access your online services. They can't find or buy your services and products. If your potential customers can't access your website, then it will affect your revenue. Also, your existing customers can't access your products and services.
How do you overcome downtime? ›- Plan for Recovery. The best way to ensure a fast recovery is to plan ahead. ...
- Keep Everything Up to Date. ...
- Educate Your Workforce. ...
- Install a Backup Power System. ...
- Test Your Infrastructure. ...
- Consider Disaster Recovery as a Service.
Importance of Reducing Unplanned Downtime
Waiting on parts or the necessary personnel to fix an issue takes time and could mean the machine is going to stay down for longer. Longer downtime is less time making product, directly effecting the bottom line.
A little downtime is important for your brain health. Research has found that taking breaks can improve your mood, boost your performance and increase your ability to concentrate and pay attention. When you don't give your mind a chance to pause and refresh, it doesn't work as efficiently.
How does downtime affect production? ›All manufacturing downtime reduces overall output by stopping production. Unplanned downtime can cost 15 times more than planned downtime. The loss of revenue during any type of asset maintenance can be as high as $3 million per incident.