The care and feeding of a data center
By Richard F. Van Loo
Managing and operating a data center comprises a wide variety of activities, including the maintenance of all the equipment and systems in the data center, housekeeping, training, and capacity management for space power and cooling. These functions have one requirement in common: the need for trained personnel. As a result, an ineffective staffing model can impair overall availability.
The Tier Standard: Operational Sustainability outlines behaviors and risks that reduce the ability of a data center to meet its business objectives over the long term. According to the Standard, the three elements of Operational Sustainability are Management and Operations, Building Characteristics, and Site Location (see Figure 1).
Management and Operations comprises behaviors associated with:
• Staffing and organization
• Planning, coordination, and management
• Operating conditions
Building Characteristics examines behaviors associated with:
• Building features
Site Location addresses site risks due to:
• Natural disasters
• Human disasters
Management and Operations includes the behaviors that are most easily changed and have the greatest effect on the day-to-day operations of data centers. All the Management and Operations behaviors are important to the successful and reliable operation of a data center, but staffing provides the foundation for all the others.
Data center staffing encompasses the three main groups that support the data center, Facility, IT, and Security Operations. Facility operations staff addresses management, building operations, and engineering and administrative support. Shift presence, maintenance, and vendor support are the areas that support the daily activities that can affect data center availability.
The Tier Standard: Operational Sustainability breaks Staffing into three categories:
• Staffing. The number of personnel needed to meet the workload requirements for specific maintenance
activities and shift presence.
• Qualifications. The licenses, experience, and technical training required to properly maintain and
operate the installed infrastructure.
• Organization. The reporting chain for escalating issues or concerns, with roles and responsibilities
defined for each group.
In order to be fully effective, an enterprise must have the proper number of qualified personnel, organized correctly. Uptime Institute Tier Certification of Operation Sustainability and Management & Operations Stamp of Approval assessments repeatedly show that many data centers are less than fully effective because their staffing plan does not address all three categories.
The first step in developing a staffing plan is to determine the overall headcount. Figure 2 can assist in determining the number of personnel required.
The initial steps address how to determine the total number of hours required for maintenance activities and shift presence. Maintenance hours include activities such as:
• Preventive maintenance
• Corrective maintenance
• Vendor support
• Project support
• Tenant work orders
The number of hours for all these activities must be determined for the year and attributed to each trade.
For instance, the data center must determine what level of shift presence is required to support its business objective. As uptime objectives increase so do staffing presence requirements. Besides deciding whether personnel is needed on site 24 x 7 or some lesser level, the data center operator must also decide what level of technical expertise or trade is needed. This may result in two or three people on site for each shift. These decisions make it possible to determine the number of people and hours required to support shift presence for the year. Activities performed on shift include conducting rounds, monitoring the building management system (BMS), operating equipment, and responding to alarms. These jobs do not typically require all the hours allotted to a shift, so other maintenance activities can be assigned during that shift, which will reduce the overall total number of staffing hours required.
Once the total number hours required by trade for maintenance and shift presence has been determined, divide it by the number of productive hours (hours/person/year available to perform work) to get the required number of personnel for each trade. The resulting numbers will be fractional numbers that can be addressed by overtime (less than 10% overtime advised), contracting, or rounding up.
Data center personnel also need to be technically qualified to perform their assigned activities. As the Tier level or complexity of the data center increases, the qualification levels for the technicians also increase. They all need to have the required licenses for their trades and job description as well as the appropriate experience with data center operations. Lack of qualified personnel results in:
• Maintenance being performed incorrectly
• Poor quality of work
• Higher incidents of human error
• Inability to react and correct data center issues
Organized for Response
A properly organized data center staff understands the reporting chain of each organization, along with their individual roles and responsibilities. To aid that understanding, an organization chart showing the reporting chain and interfaces between Facilities, IT, and Security should be readily available and identify backups for key positions in case a primary contact is unavailable.
Impacts to Operations
The following examples from three actual operational data centers show how staffing inefficiencies may affect data center availability
The first data center had two to three personnel per shift covering the data center 24 x 7, which is one of the larger staff counts that Uptime Institute typically sees. Further investigation revealed that only two individuals on the entire data center staff were qualified to operate and maintain equipment. All other staff had primary functions in other non-critical support areas. As a result, personnel unfamiliar with the critical data center systems were performing activities for shift presence. Although maintenance functions were being done, if anything was discovered during rounds additional personnel had to be called in increasing the response time before the incident could be addressed.
The second data center had very qualified personnel; however, the overall head count was low. This resulted in overtime rates far exceeding the advised 10% limit. The personnel were showing signs of fatigue that could result in increased errors during maintenance activities and rounds.
The third data center relied solely on a call in method to respond to any incidents or abnormalities. Qualified technicians performed maintenance two or three days a week. No personnel were assigned to perform shift rounds. On-site Security staff monitored alarms, which required security staff to call in maintenance technicians to respond to alarms. The data center was relying on the redundancy of systems and components to cover the time it took for technicians to respond and return the data center to normal operations after an incident.
Although these examples show deficiencies in individual data centers, many data centers are less than optimally staffed. In order to be fully effective in a management and operations behavior, the organization must be Proactive, Practiced, and Informed. Data centers may have the right number of personnel (Proactive), but they may not be qualified to perform the required maintenance or shift presence functions (Practiced), or they may not have well-defined roles and responsibilities to identify which group is responsible for certain activities (Informed).
Figure 3 shows the percentage of data centers that were found to have ineffective behaviors in the areas of staffing, qualifications, and organization.
Staffing (appropriate number of personnel) is found to be inadequate in only 7% of data centers assessed. However, personnel qualifications are found to be inadequate in twice as many data centers, and the way the data center is organized is found to be ineffective even more often. Although these percentages are not very high, staffing affects all data center management. Staffing shortcomings are found to affect maintenance, planning, coordination, and load management activities.
The effects of staffing inadequacies show up most often in data center operations. According to the Uptime Institute Abnormal Incident Reports (AIRs) database, the root cause of 39% of data center incidents falls into the operational area (see Figure 4). The causes can be attributed to human error stemming from fatigue, lack of knowledge on a system, and not following proper procedure, etc. The right, qualified staff could potentially prevent many of these types of incidents.
Adopting the proven Start with the End in Mind methodology provides the opportunity to justify the operations staff early in the planning cycle by clearly defining service levels and the required staff to support the business. Having those discussions with the business and correlating it to the cost of downtime should help management understand the returns on this investment.
Staffing 24 x 7
When developing an operations team to support a data center, the first and most crucial decision to make is to determine how often personnel need to be available on site. Shift presence duties can include a number of things, including facility rounds and inspections, alarm response, vendor and guest escorts, and procedure development. This decision must be made by weighing a variety of factors, including criticality of the facility to the business, complexity of the systems supporting the data center, and, of course, cost.
For business objectives that are critical enough to require Tier III or IV facilities, Uptime Institute recommends a minimum of one to two qualified operators on site 24 hours per day, 7 days per week, 365 days per year (24 x 7). Some facilities feel that having operators on site only during normal business hours is adequate, but they are running at a higher risk the rest of the time. Even with outstanding on-call and escalation procedures, emergencies may intensify quickly in the time it takes an operator to get to the site.
Increased automation within critical facilities causes some to believe it appropriate to operate as a “Lights Out” facility. However, there is an increased risk to the facility any time there is not a qualified operator on site to react to an emergency. While a highly automated building may be able to make a correction autonomously from a single fault, those single faults often cascade and require a human operator to step in and make a correction.
The value of having qualified personnel on site is reflected in Figure 5, which shows the percentage of data center saves (incident avoidance) based on the AIRs database.
Equipment redundancy is the largest single category of saves at 38%. However, saves from staff performing proper maintenance and having technicians on site that detected problems before becoming incidents totaled 42%.
Justifying Qualified Staff
The cost of having qualified staff operating and maintaining a data center is typically one of the largest, if not the largest, expense in a data center operating budget. Because of this, it is often a target for budget reduction. Communicating the risk to continuous operations may be the best way to fight off staffing cuts when budget cuts are proposed. Documenting the specific maintenance activities that will no longer be performed or the availability of personnel to monitor and respond to events can support the importance of maintaining staffing levels.
Cutting budget in this way will ultimately prove counterproductive, result in ineffective staffing, and waste initial efforts to design and plan for the operation of a highly available and reliable data center. Properly staffing, and maintaining the appropriate staffing, can reduce the number and severity of incidents. In addition, appropriate staffing helps the facility operate as designed, ensuring planned reliability and energy use levels.
Source link: https://journal.uptimeinstitute.com/data-center-staffing