Disaster Recovery Planning in IT Management

Disaster Recovery Planning in IT Management

Disaster recovery planning is the mechanism by which are anticipated and addressed. Just what is a “technology related disaster”?  Oddly enough, the first challenge in the planning process is to quantify the meaning of the word in the IT management context.

In IT, can be any unexpected problem that results in a slowdown, interruption or failure in a key system or network.  These problems can be caused by natural disasters (i.e. fire, earthquake, hurricane…), technology failures, malicious acts, incompatibilities, or simple human error.  Whatever the cause, service outages, connectivity failures, data loss, and related technical issues can disrupt business operations, causing lost revenues, increased expenses, customer service problems, and lowered workplace productivity.  IT disaster recovery planning strategies must be created to respond to these varied realities and perceptions.  To that end, these strategies must address three (3) basic needs:

  • Prevention (to avoid and minimize disaster frequency and occurrence).
  • Anticipation (to identify likely disasters and related consequences).
  • Mitigation (to take steps for managing disasters to minimize negative impact).

Action Item: It’s time to get your disaster recovery plans underway with the steps and techniques provided in our full IT Service Strategy Toolkit.

Fundamental Planning Goals and Objectives

There is no doubt that can offer many benefits to a business. Once you acknowledge the value of technology to your organization, you must also consider the related consequences if and when that technology becomes temporarily unavailable, or totally inaccessible.  Your ability and willingness to address these issues can offer several key operational benefits:

  • To minimize the negative impact of any disaster.
  • To save time and money in the recovery process in the event of a disaster.
  • To provide for an orderly recovery process, reducing “panic” decision making.
  • To protect technology assets owned by a business, maximizing ROI.
  • To minimize legal or regulatory liabilities.
  • To promote systems and IT service quality, reliability and security.
  • To promote the value of technology and related IT services within your organization.
  • To promote management awareness, and to set realistic expectations about the need for systems management tools and resources.

Disaster Recovery Planning in Practice

In the IT management context, there are many levels to defining “disaster” and multiple options to address each level.  To make things easier, the broad view of disaster recovery can be broken down into three (3) primary planning options –prevention, anticipation and mitigation.

Prevention:  Avoiding Disaster Events to the Extent Possible

The goal of “preventative” disaster recovery planning is to ensure that all key systems are as secure and reliable as possible, in order to reduce the frequency or likelihood of “technology related disasters”. Since natural disasters usually lie outside our sphere of influence, prevention most often applies to systems problems and human errors, to include physical hardware failures, software bugs, configuration errors and omissions, and acts of malicious intent (virus attacks, security violations, data corruption…). Using the right set of tools and techniques, it is possible to preclude both the occurrence and related damage from any and all of these sorts of “disasters”.

Anticipation:  Planning for the Most Likely Events

Anticipation strategies revolve around “assumptions” …. the ability to foresee possible disasters, in order to identify possible consequences and appropriate responses. Without a crystal ball, contingency planning can be a challenging process. It involves knowledge and careful analysis. Knowledge is derived from experience and information …. understanding the systems you have, how they are configured, and what sort of problems or failures are likely to occur. And the related analysis involves a careful balancing of circumstances and consequences.

Mitigation:  Get Ready to React and Recover

Mitigation is all about “reaction and recovery” …. the ability to respond when and if a disaster occurs. Accepting that certain disasters are unavoidable, and perhaps inevitable, the goal of any mitigation strategy is to minimize negative impact.

  1. Maintain current technical documentation to facilitate recovery should a problem occur.
  2. Conduct regular tests of your disaster recovery plans and strategies.
  3. Keep loaner equipment available for immediate use.
  4. Create regular back-ups of applications, data and hardware configurations.
  5. Maintain an “alternative workplace plan” to allow designated staff to work from home or other locations.
  6. Identify manual or standalone operating procedures in the event of a prolonged outage.
  7. Coordinate IT disaster recovery plans with other corresponding emergency, security and employee safety programs/policies.

http://infochief.com.vn/ &   http://it-toolkits.org/

Open source systems management tools

If your IT shop has the right skills, open source systems management tools may be a fit for your data center and save money over proprietary solutions. This slides will show features some of the top tools.

Large IT organizations turn to open source systems management tools

Top areas where open source systems management tools used
(Click here for a larger version)

Usenix, a systems administration user group, and Zenoss, an open source systems management vendor, recently completed a survey on open source systems management software use between 2006 and 2009. Respondents were attendees of the organization’s Large Installation System Administrators conference. Nearly all respondents use or plan to use open source systems management tools, with many shops turning toNagios, Cacti, Zabbix, GroundWork and the OpenNMS project. When asked “What are the top areas where you plan to use open source systems management tools?” 90% answered monitoring, around 60% said configuration and around 50% said patch management.

The benefits of open source systems management

Top reason for using open source software
(Click here for a larger image)

When asked the question “Why did you or would you be likely to try open source software?” responding shops said that they have turned to open source systems management tools to reduce costs and increase flexibility. Easy deployment was also a top reason for trying open source. In 2006, only 26% of survey respondents indicated this as a reason for using open source; in 2009, however, 71% of all respondents indicated this as a reason for using open source. This finding may indicate that open source not only removes technical hurdles but also preempts some of the bureaucratic obstacles associated with the traditional technology procurement process.”Open source offerings are newer and often written to be easier to deploy than older systems,” said Michael Coté, an analyst at RedMonk, an industry analyst firm. “An admin can download and install it without asking for funding, agreeing to any terms for a trial or filling out registration forms. Being able to download a piece of software by right-clicking is going to be easier than most other acquisition paths.”

The drawbacks to open source systems management

Top reasons for not using open source
(Click here for a larger image)

So what are the primary reasons IT shops would not use open source tools? Lack of support was the main culprit, and users said proprietary tools had better support and product maturity as well as less risk.”You get the support you pay for,” Coté said. “If you don’t want to pay anything, just download Nagios, OpenNMS or Zenoss Core and go at it alone. You’ll be paying in your time: time to ask questions in forums and wait for answers, time to look through existing write-ups on the Web, and, if you’re of the right kind of mind, time to look through the code yourself. Closed-source offerings can seem to have more support available because you’re required to buy support.”

Ed Bailey, a Unix team lead at a major credit reporting agency, uses the proprietary version Hyperic HQ Enterprise to manage Web applications that drive his company’s revenue. Bailey said he doesn’t have the time to cobble together — let alone develop and maintain — the automation, security and reporting features that ship with the enterprise version. “You can make a reporting system for the open source version of Hyperic HQ. If you have the time, you can make anything. But our company is more focused on things that generate revenue rather than me spending time working on this,” Bailey said. “I used to work at a university and we had time to build something like that, whereas now we have millions of transactions that are making money.”

Special skills to use open source systems management tools?

What skill set do sys admins need to have to deploy systems management software successfully in an IT organization? “Any scripting experience in general is helpful,” said Ryan Matte, a data center admin at Nova Networks Inc. “Basic Python knowledge is very helpful when using Zenoss. I often use Bash scripting as well. A decent understanding of SNMP [Simple Network Management Protocol] is definitely required (since the open source products don’t tend to be as automated as the enterprise products). I often find myself developing custom SNMP monitoring templates for devices, [but] … you should have an understanding of whatever protocols you are working with. An understanding of Linux/BSD [Berkeley Software Distribution] is helpful as well since most of the open source monitoring products that I’ve seen only run on Linux/BSD.”

Virtualization driving proprietary management tool dominance

% of respondents who cite that product features have become the more important advantage of proprietary software
(Click here for a larger version)

Starting in 2009, a much larger percentage of data center managers indicated proprietary systems management software has an advantage over open source tools in advanced product features. In 2009, 33% of all respondents indicated that product features played a bigger part in defining the advantages of commercial tools, versus 10% in the previous year. Though not explicitly spelled out in the survey, you can translate product features to “virtualization management features.” Matte is using Zenoss’ open source offering, Zenoss Core, and said he has evaluated Zenoss’ proprietary enterprise ZenPacks, which have virtual machine management features. “I have taken a look at the enterprise ZenPacks, and there is nothing like the VMware [Management] Pack in the open source community,” Matte said.

Open source systems management profile: Spacewalk

Spacewalk

Spacewalk is an open source Linux systems management tool and the upstream community project from which the Red Hat Network Satellite product is derived. Spacewalk provides provisioning and monitoring capabilities as well as software content management.James Hogarth, a data center admin in the U.K., uses Spacewalk to manage 100 hosts in a CentOS-based environment for an entertainment website built on the Grails distribution. Hogarth said his company’s entire environment is focused on open source software — even migrating server virtualization from VMware to the Red Hat Kernel-based Virtual Machine (or KVM) hypervisor — and that open source focus was a major factor in the decision to use open source systems management tools.

Hogarth said he’s run into some gotchas and issues that needed a workaround, but overall Spacewalk has lightened his support workload. Most of the development is done by Red Hat personnel, and the developers are often available to answer questions and troubleshoot issues. “People are very responsive [on the support forum], and it’s relatively rare that you don’t get a response,” Hogarth said. “Over the last two years, the product has really matured.”

Open source data center automation and configuration tools

Puppet is one option

In the open source space, Cfengine and Puppet are leading data center automation and configuration tools. In 1993, Mark Burgess at Oslo University College wrote Cfengine, which can be used to build, deploy, manage and audit all the major operating systems. Cfengine boasts somelarge customers, including companies such as eBay and Google. Cfengine offers a proprietary commercial version called Cfengine Nova. As an open source-only product, Puppet takes a different approach, and its creators, Puppet Labs, make money through training and support.Puppet founder Andrew Schafer, for example, wrote a column on Puppet and how it works. Also, James Turnbull recently wrote a book on using Puppet in the data center. Turnbull has also written tips on Puppet, including the recent article on using the Puppet dashboard. The Oregon State University Open Source Laboratory uses Cfengine for systems management but planned to move to Puppet. “From a technical point of view, Puppet offers more flexibility and an ability to actually use real code to deal with tasks. Cfengine has its own syntax language, but it’s not really suited for complex tasks,” said OSUOSL administrator Lance Albertson in an interview earlier this year.

Open core versus open source software

Some companies offer what’s considered “open core” systems management software. At the base level is a functional, free open source tool (like Zenoss Core or Hyperic HQ), and there is a separate proprietary enterprise version with special add-ons and features. This business model rankles some open source advocates, but it offers companies the chance to use a tool risk free, and oftentimes organizations can make the free version work.Ryan Matte, a data center admin at Ottawa, Ontario-based Nova Networks Inc., uses Zenoss Core to manage more than 1,000 devices, monitoring Windows, Linux, Solaris and network devices. Matte considered Nagios, Zabbix, and OpenNMS. “In terms of ease of use and setup and having all the monitoring capabilities in the product, Zenoss was the best choice,” he said. “There’s an IRC channel chat room — I’m in there quite a bit. There are always people in there. The [community] support is pretty good, but you have to come in during business hours.”

Using Webmin for data center server management

Webmin

Webmin offers a browser-based interface to Unix and Linux operating systems. It can configure users, disk quotas, services or configuration files as well as modify and control open source apps. Here are some tips on using Webmin:

Using Nagios in the data center to manage servers

Nagios

In many data center environments, Nagios has become the de facto standard for companies in need of an open source, fault-tolerant solution to monitor single points of failure, service-level agreement shortcomings, servers, redundant communication connections or environmental factors. But is this one-size-fits-all open source tool best suited to your data center? Here are some SearchDataCenter.com tips on Nagios:

Best Practices for data center monitoring and server room monitoring

1. Rack Level Monitoring
Based on a recent Gartner study, the annual cost of a Wintel rack averages around $70,000 USD per year. This excludes the business cost of a rack. Risking losing business continuity or your infrastructure due to environmental issues is not an option. What are the environmental threats at a rack level?A mistake often made is to only rely on monitoring the conditions at a room level and not at a rack level. The American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) recommends no less than 6 temperature sensors per rack in order to safeguard the equipment (top, middle, bottom at front & back of rack). When a heat issue arises, air conditioning units will initially try to compensate the problem. This means that with room level temperature monitoring, the issue will only be detected when the running air conditioning units are no longer capable of compensating the heat problem. It may be too late then.We recommend monitoring temperature per rack at a minimum of 3 points: at the bottom front of the rack to verify the temperature of the cold air arriving to the rack (combined with airflow monitoring); at the top front of the rack to verify if all cold air gets to the top of the rack; and finally one at the top back of the rack which is typically the hottest point of the rack. Intake temperature should be between 18°-27°C / 64°-80°F. Outtake temperature should typically be not more than 20°C / 35°F of the intake temperature.

What is the impact of temperature on your systems? High end systems have auto shutdown capabilities to safeguard themselves against failures when temperature is too high. However before this happens systems will experience computation errors at a CPU level resulting in application errors. Then system cooling (fan) will be stressed reducing equipment life time expectance (and as such their availability and your business continuity).

   2. Ambient room monitoring
Ambient room monitoring is the environmental monitoring of the room for its humidity and temperature levels. Temperature and humidity sensors are typically deployed in:

  • potential “hot zones” inside the server room or data center
  • near air conditioning units to detect failure of such systems.When multiple air conditioning systems are available in a room, then a failure of one system will initially be compensated by the others before it may lead to a total failure of the cooling system due to overload. As a result temperature / airflow sensors are recommended near each unit to get early failure detection.Humidity in server rooms should be between 40% and 60% rH. Too dry will result in the build up of static electricity on the systems. Too humid and corrosion will start slowly damaging your equipment resulting in permanent equipment failures.

    When using cold corridors inside the data center, then ambient temperature outside the corridor may be at higher levels. Temperatures of 37°C / 99°F are not uncommon in such setups. This allows to significantly reduce the energy cost. However this also means that temperature monitoring is of utmost importance as a failing air conditioning unit will have a way faster impact on the systems lifetime and availability (fans stress, CPU overheating, …) and running a room at higher temperatures may also affect non rack mounted equipment.

    When using hot corridors it is important to monitor temperature across the room to ensure that sufficient cold air gets to each rack. In this case however one can also rely on rack based temperature sensors in addition of temperature and humidity sensors close to each air conditioning unit.

    3. Water & Flooding Monitoring

    Water leakage is a less known threat for server rooms & data centers. The fact that most data centers and server rooms have raised floors makes the risk even bigger as water seeks the lowest point.

    Two type of sensors for water leakage can be commonly found: spot and water snake cable based. Spot sensors will trigger an alert when water touches the unit. Water rope or water snake cable sensors use a conductive cable whereby contact at any point on the cable will trigger an alert. The latter type is recommended over the first one due to its higher range and higher accuracy.

    If using a raised floor, then one should consider putting the sensor under the raised floor as water seeks the lowest point.

    The four main sources of water in a server room are:

  • leaking air conditioning systems: a water sensor should be placed under each AC unit
  • water leaks in floors or roof above the data center & server room: water sensors should be put around the perimeter of the room at around 50cm/3ft from the outer walls
  • leaks of water pipes running through server rooms: a water sensor should be placed under the raised floors
  • traditional flooding: same as second point for water leaks from roof or above floors applies                                                                                                                                                                                                                                               4. Sensors DeploymentAll sensors connect to our Sensorgateway (base unit). A base unit supports up to 2 wired sensors, or up to 8 with the optional sensor hub.
    Application Location Setting SKU Sensor Package
    Rack Level Monitoring
    Sensors to monitor intake temperature Front – Bottom of rack for room or floor cooling, top of rack for top cooling 18-27°C / 64-80°F 182668 Temperature probes*
    Sensors to monitor outtake temperature Back – Top of rack (hot air climbs) less than 20°C / 35°F difference from inlet temperature (typically <40°C / 105°F) 182668 Temperature probes*
    Ambient Monitoring
    Temperature & humidity monitoring in server room small server rooms: center of the room data centers: potential hot zones – furthest away from airco units Temperature depends on type of room setup
    Humidity: 40-60% rH
    306166 Temperature & Humidity Sensor Probe*
    Airconditioning Monitoring
    Early detection of failing air conditioning units next to airco units Temperature depends on setting of airco
    Humidity: 40-60% rH
    306166 Temperature & Humidity Sensor Probe*
    Water Leaks / Flooding
    Detecting water leaks coming from outside of room Around outside walls of server room / data center and under raised floor
    best is to keep a 30-50cm / 10-20″ from outer wall
    180004 Flooding Sensor Probe* with 6m/20ft water sensitive cable
    Detecting water leaks from air conditioning units Under each air conditioning unit 180004 Flooding Sensor Probe* with 6m/20ft water sensitive cable

    * External probes need to be connected to a Sensorgateway (SKU 311323) in order to operate. One Sensorgateway has a built-in temperature probe and can support up to 2 external probes.

Source from: https://serverscheck.com/sensors/temperature_best_practices.asp

 

 

Best Practices for Data Center Relocation and Migration

Article by Nilesh Rane

DC consolidation and migration journey is a rocky one with challenges such as operational disruption, etc. Mandar Kulkarni, Senior Vice President, Netmagic Solutions shares some best practices to follow for a successful DC relocation and migration project.DC migration and consolidation is an uncomfortable truth for most of these Data Center Managers or CIOs.

An optimally functioning Data Center is business critical. But chances are that your organizations data center is not adequate in some way. Either it is growing out-of-capacity, compute requirements, operationally exorbitant, outdated or simply doesn’t match up to the growth of the organization.

According to a recently published Data Center survey report, over 30% of organizations across the globe plan to migrate or expand their data centers within the next 3 years. Most DCs in India are over 5-7 years old and are not designed for power and cooling needs of today, are running out of space or performance, and their total cost of ownership is almost surpassing the growth in business revenues.

Unplanned DC relocation and migration exercise, done without help of experts run into risky waters resulting in cost issues to downtimes and business loss or complete blackout. Here are some best practices to ensure that DC migration project is successful.

Best Practices For DC Migration

Solution to mitigate challenges of DC relocation and migration is pretty simple. It is important to create a design and migration plan keeping in mind all the common pitfalls and crating contingencies for them. Some of the best practices for successful DC migration are as follows:

Start at the very beginning

Start the migration process as you would build a data center. Look at the migration exercise to ensure the new DC will have planned for at least 2 lifecycles of infrastructure.

Identify and detail the starting point

It is important to do a comprehensive review of the current DC. Identify and document your organizations technology and business requirements, priorities and processes. Then do a detailed review of the costs involved in various methods of DC migration and consolidation.

Design the migration strategy

It is important to establish business downtime, determine hardware, application and other technology requirements, and prioritize business processes. Identify at least 2 migration methodologies and create plan for both. Bring in all vendors and utility providers into the migration strategies and take them along.

Plan the layout – space planning

It is important to plan the new DC layout before you plan the migration plan. Think about white spaces, creating enough to allow for future growth – it is important to plan the space judiciously, and take help of DC architects to successfully design this part.

Plan the DC migration

Putting relocation design into action plan – detailed floor plan, responsibility chart and checklists, migration priorities, map interdependencies, etc. Take into the plan inputs from telecom and power providers, technology vendors, and specialists.

Inventory everything

Start with a detailed inventory of everything – from applications to business needs to infrastructure including each cable and device to network including every link and port. It all needs to go into a database similar to CMDB.

Create a baseline

It is critical to know the current DC performance and TCO ratings. Basically, it is important to know your DC well before migration and clear understanding of all aspects of it. Create a baseline for your DC so that it is easy to measure and tweak performance and efficiencies post migration.

Identify and create a risk management plan

Organizations should simply assume that things would go wrong and create adequate contingency plan. Detailing and drafting a fully documented risk mitigation and management plan is essential. Then assess, classify, and prioritize them for the purposes of mitigation.

Take users and business owners along

It is important to inform all users of the migration plan, from end users to support teams and business owners. The key is to plan to the last T and go through the plan to the minutest detail. Make sure to bring all the critical people to the planning events – facilities staff, project management teams, etc.

Identifying the right time for migration

It is important to select the right time for migration – such as choosing non-month-end and year-end, not coinciding with public events such as elections, festivals, etc.

Logistics arrangements

Arrangement of logistics arrangement needs some looking into – who is going to pack and number, name all equipment, who is going to move the equipment to the destination, is there a backup vehicle in case of break down, is there a need for armed guards for the transportation of equipment, etc.

Upgrading systems during the migration

Old servers, switches, and storage devices that are out of warranty or considered a risk when subjected to strains and stress of migration should be identified and considered while planning to replace with new. It is an opportunity for you to consider reducing the overall footprint through consolidation in quest to improve reliability, performance, and efficiency of your DC. It is a popular practice to use data center move to consolidate the DC through virtualization.

Do pre and post migration testing

It is important to create a baseline on infrastructure, network and applications before executing the migration plan. It is important to exactly know how things work – creating the baseline. Document and repeat the tests – a full-fledged success plan.

Rely on experience

DC relocation and migration is not a regular occurrence for any single IT professional to have substantial experience. It is highly recommended to entrust the DC relocation and migration exercise in the hands of an experienced organization who have proven capabilities.

Consider Experts

If it is only a data center move from one location to the other you should consider a reputable third party to support the move – professional IT mover who will use specialized packing materials, etc. It is recommended to use a professional DC provider with expertise in DC migration and relocation – these establishments will have proven data center relocation methodology and best practices that they can leverage for better results and success.

Contingency planning

Finally, even superior planning cannot offset unexpected failures. Contingency planning is critical even after the migration plan has taken into consideration all the common pitfalls. Planning for a failure is better than running pillar to post when it occurs.

Standby Equipment

If during transportation equipment is damaged or does not function at the destination, it amounts to delays or disruptions in setting up the new DC. It is important for the DC migration expert helping you to have standby equipment in cases such as these.

Insurance

It is important for insuring all equipment in case of any major disasters occur during the whole migration process. If you are using a professional DataCenter provider, it is important to add insurance to the checklist of requirements.

Identify and Plan for External Dependencies

It is critical to identify all external dependencies such as network service providers, etc. and their availability at the destination.

In Conclusion

In today’s dynamically changing marketplace and unpredictable economic climate, it is critical that data centers facilitate current business operations as well as provide for the future growth of the business. Following the best practices will ensure success of the DC relocation and migration – a good way to prevent disaster.

link source: http://www.netmagicsolutions.com/blog/datacenter-migration-best-practices#.VsQ3lIV97IU

How do you define data center size, density?

With shifts in the scale and density of data centers, one industry organization is drawing up ways to standardize how we talk about data center size and power needs.

There are plenty of metrics to measure data center footprint and power and cooling needs. AFCOM, the data center managers’ association, thinks it’s time to pare that down.
“You’ll hear people say ‘I have a very dense data center’ or ‘We have a small data center’ and that doesn’t really mean anything or relate to specific numbers,” said Tom Roberts, AFCOM president.

The association’s Data Center Institute think tank worked with data center designers, operators and vendors to qualify the terms for data center size and density, presented in the free paper, Data Center Standards. Read an excerpt from the paper here.

AFCOM describes data center size by compute space, and density by measured peak kilowatt (kW) load.

To the extreme

AFCOM segments data center density into four categories: low (up to 4 kW per rack), medium (5 kW to 8 kW), high (8 kW to 15 kW) and extreme (more than 16 kW per rack average).

The focus on density is timely. Colocation contracts revolve more around power today than they did five years ago, when the conversation was about space, said John Sheputis, president, Infomart Data Centers, a U.S. colocation space provider.

Server consolidation — via virtualization and processor evolution — increases data center density per square foot. There are fewer cabinets and fewer power supplies to manage, with less fiber to run — all good things from an IT operations point of view, Sheputis said. But these trends change the understanding of high and low density.

Cosentry, a colocation provider headquartered in Omaha, Neb., tracks average power draw per cabinet in its facilities to baseline server space designs.

“Ten years ago, average power draw per cabinet was probably 700 to 800 watts,” said Jason Black, VP of data center services at Cosentry. “Five years ago, it was 1.5 kW. Now, 3 kW. On current trend, we’ll see five or six kilowatt average power draw in five years.”

Infomart experienced this firsthand when merging its Dallas operations with Fortune Data Centers’ Hillsboro, Ore., and San Jose operations, and acquired a former AOL data center in Ashburn, Va.

“The energy density of older data centers is two to three times lower than in newer data centers,” Sheputis said, adding that standards for energy density change greatly in a short time.

This was evident comparing the older Ashburn facility to the state-of-the-art facility in Dallas. Ashburn will undergo a renovation, not just for space but for higher-density operations, before opening in 2015.

AFCOM plans to aggregate similar baseline tracking and comparison data for a broad swath of data centers by standardizing size and density terminology.

Devil in the density details

Although AFCOM’s categories classify the total density of the data center, the devil for planning that space is in the details.

The same square footage that previously held 2 kW mixed cabinets now has a row of 8 kW servers, a set of storage arrays consuming 4 kW each, and low-power network and peripheral cabinets. A supercomputing island in one part of the data center handles big data processing at 15 kW per rack, while the other racks use only 3 kW or 4 kW each. Facility planning isn’t just about aggregate power and cooling needs, but also the layout of IT systems using the space.

Square footage discussions are still useful, Black said. But the most important thing is how many rack location units are available in a given space.

AFCOM therefore segments data center sizes, from mini (room for up to 10 racks) through mega (room for more than 9,000 racks), in combination with the density measurements above that yield power demand information.

“Watts per square foot is a flawed standard for today’s workloads,” Cosentry’s Black said.

Rack location units is a term that’s evolved recently to help estimate utilization in a given room footprint, or estimate capacity. It takes into account the cabinet footprint and hot and cold aisle allowances. But not every IT organization can discuss their data center needs by this metric.

“In many cases, the art of managing physical space has been dished off to IT people with expertise in other areas, like storage and network,” Black said. “Most people are sub-optimized in the data center and don’t know best practices.”

In an on-premises data center, perhaps clarity around power and density doesn’t matter as much. The power bill comes out of the facilities budget, and as long as cooling keeps up with the hottest cabinet in the room, your terminology is unimportant. But today, on-premises facilities face end of life or major upgrades, power usage effectiveness comes under executive-level (and executive branch) scrutiny, and many companies plan the move into a colocation facility. Suddenly, IT leaders need to know how to communicate effectively about the space, power and cooling that important workloads require.

AFCOM’s intent is for a data center manager to be able to measure compute space, designed density and current power draw, and say that they run, for example, a small-size data center designed for low density, currently operating at medium density at 52% of rack yield.

link source: http://searchdatacenter.techtarget.com/feature/How-do-you-define-data-center-size-density

 

Proper Data Center Staffing is Key to Reliable Operations

The care and feeding of a data center
By Richard F. Van Loo

Managing and operating a data center comprises a wide variety of activities, including the maintenance of all the equipment and systems in the data center, housekeeping, training, and capacity management for space power and cooling. These functions have one requirement in common: the need for trained personnel. As a result, an ineffective staffing model can impair overall availability.

The Tier Standard: Operational Sustainability outlines behaviors and risks that reduce the ability of a data center to meet its business objectives over the long term. According to the Standard, the three elements of Operational Sustainability are Management and Operations, Building Characteristics, and Site Location (see Figure 1).

Figure 1. According to Tier Standard: Operational Sustainability, the three elements of Operational Sustainability are Management and Operations, Building Characteristics, and Site Location.

Management and Operations comprises behaviors associated with:

• Staffing and organization

• Maintenance

• Training

• Planning, coordination, and management

• Operating conditions

Building Characteristics examines behaviors associated with:

• Pre-Operations

• Building features

• Infrastructure

Site Location addresses site risks due to:

• Natural disasters

• Human disasters

Management and Operations includes the behaviors that are most easily changed and have the greatest effect on the day-to-day operations of data centers. All the Management and Operations behaviors are important to the successful and reliable operation of a data center, but staffing provides the foundation for all the others.

Staffing
Data center staffing encompasses the three main groups that support the data center, Facility, IT, and Security Operations. Facility operations staff addresses management, building operations, and engineering and administrative support. Shift presence, maintenance, and vendor support are the areas that support the daily activities that can affect data center availability.

The Tier Standard: Operational Sustainability breaks Staffing into three categories:

• Staffing. The number of personnel needed to meet the workload requirements for specific maintenance
activities and shift presence.

• Qualifications. The licenses, experience, and technical training required to properly maintain and
operate the installed infrastructure.

• Organization. The reporting chain for escalating issues or concerns, with roles and responsibilities
defined for each group.

In order to be fully effective, an enterprise must have the proper number of qualified personnel, organized correctly. Uptime Institute Tier Certification of Operation Sustainability and Management & Operations Stamp of Approval assessments repeatedly show that many data centers are less than fully effective because their staffing plan does not address all three categories.

Headcount
The first step in developing a staffing plan is to determine the overall headcount. Figure 2 can assist in determining the number of personnel required.

Figure 2. Factors that go into calculating staffing requirements

The initial steps address how to determine the total number of hours required for maintenance activities and shift presence. Maintenance hours include activities such as:

• Preventive maintenance

• Corrective maintenance

• Vendor support

• Project support

• Tenant work orders

The number of hours for all these activities must be determined for the year and attributed to each trade.

For instance, the data center must determine what level of shift presence is required to support its business objective. As uptime objectives increase so do staffing presence requirements. Besides deciding whether personnel is needed on site 24 x 7 or some lesser level, the data center operator must also decide what level of technical expertise or trade is needed. This may result in two or three people on site for each shift. These decisions make it possible to determine the number of people and hours required to support shift presence for the year. Activities performed on shift include conducting rounds, monitoring the building management system (BMS), operating equipment, and responding to alarms. These jobs do not typically require all the hours allotted to a shift, so other maintenance activities can be assigned during that shift, which will reduce the overall total number of staffing hours required.

Once the total number hours required by trade for maintenance and shift presence has been determined, divide it by the number of productive hours (hours/person/year available to perform work) to get the required number of personnel for each trade. The resulting numbers will be fractional numbers that can be addressed by overtime (less than 10% overtime advised), contracting, or rounding up.

Qualification Levels
Data center personnel also need to be technically qualified to perform their assigned activities. As the Tier level or complexity of the data center increases, the qualification levels for the technicians also increase. They all need to have the required licenses for their trades and job description as well as the appropriate experience with data center operations. Lack of qualified personnel results in:

• Maintenance being performed incorrectly

• Poor quality of work

• Higher incidents of human error

• Inability to react and correct data center issues

Organized for Response
A properly organized data center staff understands the reporting chain of each organization, along with their individual roles and responsibilities. To aid that understanding, an organization chart showing the reporting chain and interfaces between Facilities, IT, and Security should be readily available and identify backups for key positions in case a primary contact is unavailable.

Impacts to Operations
The following examples from three actual operational data centers show how staffing inefficiencies may affect data center availability

The first data center had two to three personnel per shift covering the data center 24 x 7, which is one of the larger staff counts that Uptime Institute typically sees. Further investigation revealed that only two individuals on the entire data center staff were qualified to operate and maintain equipment. All other staff had primary functions in other non-critical support areas. As a result, personnel unfamiliar with the critical data center systems were performing activities for shift presence. Although maintenance functions were being done, if anything was discovered during rounds additional personnel had to be called in increasing the response time before the incident could be addressed.

The second data center had very qualified personnel; however, the overall head count was low. This resulted in overtime rates far exceeding the advised 10% limit. The personnel were showing signs of fatigue that could result in increased errors during maintenance activities and rounds.

The third data center relied solely on a call in method to respond to any incidents or abnormalities. Qualified technicians performed maintenance two or three days a week. No personnel were assigned to perform shift rounds. On-site Security staff monitored alarms, which required security staff to call in maintenance technicians to respond to alarms. The data center was relying on the redundancy of systems and components to cover the time it took for technicians to respond and return the data center to normal operations after an incident.

Assessment Findings
Although these examples show deficiencies in individual data centers, many data centers are less than optimally staffed. In order to be fully effective in a management and operations behavior, the organization must be Proactive, Practiced, and Informed. Data centers may have the right number of personnel (Proactive), but they may not be qualified to perform the required maintenance or shift presence functions (Practiced), or they may not have well-defined roles and responsibilities to identify which group is responsible for certain activities (Informed).

Figure 3 shows the percentage of data centers that were found to have ineffective behaviors in the areas of staffing, qualifications, and organization.

Figure 3. Ineffective behaviors in the areas of staffing, qualifications, and organization.

Staffing (appropriate number of personnel) is found to be inadequate in only 7% of data centers assessed. However, personnel qualifications are found to be inadequate in twice as many data centers, and the way the data center is organized is found to be ineffective even more often. Although these percentages are not very high, staffing affects all data center management. Staffing shortcomings are found to affect maintenance, planning, coordination, and load management activities.

The effects of staffing inadequacies show up most often in data center operations. According to the Uptime Institute Abnormal Incident Reports (AIRs) database, the root cause of 39% of data center incidents falls into the operational area (see Figure 4). The causes can be attributed to human error stemming from fatigue, lack of knowledge on a system, and not following proper procedure, etc. The right, qualified staff could potentially prevent many of these types of incidents.

Figure 4. According to the Uptime Institute Abnormal Incident Reports (AIRs) database, the root cause of 39% of data center incidents falls into the operational area.

Adopting the proven Start with the End in Mind methodology provides the opportunity to justify the operations staff early in the planning cycle by clearly defining service levels and the required staff to support the business.  Having those discussions with the business and correlating it to the cost of downtime should help management understand the returns on this investment.

Staffing 24 x 7
When developing an operations team to support a data center, the first and most crucial decision to make is to determine how often personnel need to be available on site. Shift presence duties can include a number of things, including facility rounds and inspections, alarm response, vendor and guest escorts, and procedure development. This decision must be made by weighing a variety of factors, including criticality of the facility to the business, complexity of the systems supporting the data center, and, of course, cost.

For business objectives that are critical enough to require Tier III or IV facilities, Uptime Institute recommends a minimum of one to two qualified operators on site 24 hours per day, 7 days per week, 365 days per year (24 x 7). Some facilities feel that having operators on site only during normal business hours is adequate, but they are running at a higher risk the rest of the time. Even with outstanding on-call and escalation procedures, emergencies may intensify quickly in the time it takes an operator to get to the site.

Increased automation within critical facilities causes some to believe it appropriate to operate as a “Lights Out” facility. However, there is an increased risk to the facility any time there is not a qualified operator on site to react to an emergency. While a highly automated building may be able to make a correction autonomously from a single fault, those single faults often cascade and require a human operator to step in and make a correction.

The value of having qualified personnel on site is reflected in Figure 5, which shows the percentage of data center saves (incident avoidance) based on the AIRs database.

Figure 5. The percentage of data center saves (incident avoidance) based on the AIRs database

Equipment redundancy is the largest single category of saves at 38%. However, saves from staff performing proper maintenance and having technicians on site that detected problems before becoming incidents totaled 42%.

Justifying Qualified Staff
The cost of having qualified staff operating and maintaining a data center is typically one of the largest, if not the largest, expense in a data center operating budget. Because of this, it is often a target for budget reduction. Communicating the risk to continuous operations may be the best way to fight off staffing cuts when budget cuts are proposed. Documenting the specific maintenance activities that will no longer be performed or the availability of personnel to monitor and respond to events can support the importance of maintaining staffing levels.

Cutting budget in this way will ultimately prove counterproductive, result in ineffective staffing, and waste initial efforts to design and plan for the operation of a highly available and reliable data center. Properly staffing, and maintaining the appropriate staffing, can reduce the number and severity of incidents. In addition, appropriate staffing helps the facility operate as designed, ensuring planned reliability and energy use levels.

Source link: https://journal.uptimeinstitute.com/data-center-staffing