Top 10 data center operating procedures

Every data center needs to define its policies, procedures, and operational processes.

An ideal set of documentation goes beyond technical details about application configuration and notification matrices.

These top 10 areas should be part of your data center’s standard operating procedures manuals.

    1. Change control. In addition to defining the formal change control process, include a roster of change control board members and forms for change control requests, plans and logs.
    2. FacilitiesInjury prevention program information is a good idea, as well as documentation regarding power and cooling emergency shut offprocesses; fire suppression system information; unsafe condition reporting forms; new employee safety training information, logs and attendance records; illness or injury reporting forms; and visitor policies.
    3. Human resources. Include policies regarding technology training, as well as acceptable use policies, working hours and shift schedules, workplace violence policies, employee emergency contact update forms, vacation schedules, and anti-harassment and discrimination policies.
    4. Security. This is a critical area for most organizations. Getting all staff access to the security policies of your organization is half the battle. An IT organization should implement policies regarding third-party or customer system access, security violations, auditing, classification of sensitive resources, confidentiality, physical security, passwords, information control, encryption and system access controls.
    5. Templates. Providing templates for regularly used documentation types makes it easier to accurately capture the data you need in a format familiar to your staff. Templates to consider include policies, processes, logs, user guides and test/report forms.
    6. Crisis management. Having a crisis response scripted out in advance goes a long way toward reducing the stress of a bad situation. Consider including crisis management documentation around definitions; a roster of crisis response team members; crisis planning; an escalation and notification matrix; a crisis checklist; guidelines for communications; situation update forms, policies, and processes; and post-mortem processes and policies.
    7. Deployment. Repeatable processes are the key to speedy and successful workload deployments. Provide your staff with activation checklists, installation procedures, deployment plans, location of server baseline loads or images, revision history of past loads or images and activation testing processes.
    8. Materials management. Controlling your inventory of IT equipmentpays off. Consider including these items in your organization’s documentation library: policies governing requesting, ordering, receiving and use of equipment for testing; procedures for handling, storing, inventorying, and securing hardware and software; and forms for requesting and borrowing hardware for testing.
    9. Internal communications. Interactions with other divisions and departments within your organization may be straightforward, but it is almost always helpful to provide a contact list of all employees in each department, with their work phone numbers and e-mail addresses. Keep a list of services and functions provided by each department, and scenarios in which it may be necessary to contact these other departments for assistance.
    10. Engineering standardsTesting, reviewing and implementing new technology in the data center is important for every organization. Consider adding these items to your organization’s standard operating procedures manuals: new technology request forms, technology evaluation forms and reports, descriptions of standards, testing processes, standards review and change processes, and test equipment policies.

About the author
Kackie Cohen is a Silicon Valley-based consultant providing data center planning and operations management to government and private sector clients. Kackie is the author of Windows 2000 Routing and Remote Access Service and co-author of Windows XP Networking.

source from: http://searchdatacenter.techtarget.com/tip/Top-10-data-center-operating-procedures

Data Center Migrations & Consolidations

Business Challenge:

A major academic based health system in Philadelphia required migration services consisting of relocating nearly 600 servers and related technology equipment from the primary data center to the new “designated data center”. ComSource needed to develop a data center migration plan which allowed the move to occur in phases, enabling the IT staff to concentrate on one critical factor at a time and minimize the danger of excessive downtime at any point during the move.

The Solution:

ComSource, in conjunction with our migration partner, worked closely with the client’s project management office, attending pre-move meetings and planning sessions to develop a “playbook” based on a selected “move event” approach and timeline. Key to the plan’s success was our move methodology which was based on the applications priorities and dependencies. Once these “logical” dependencies were determined, a hardware or physical dependency check was performed. This helped put the servers into various groups and identify which ones needed an asset swap, parallel, forklift, etc. type of approach. The data migration took place during 13 move events utilizing one truck per move. ComSource also provided relocation of the customer’s IT equipment, including packing and crating, loading, transporting, unloading and uncrating. As part of this data center migration move it was imperative that customer manufacturer hardware agreements and warranty coverage remained valid throughout the relocation event. ComSource provided consulting services which included suggesting best practices for relocating, inventorying, communicating, change management and other measures associated with the data center migration.

Results:

ComSource successfully completed a nearly 600 server and technology infrastructure move over a 13 weekend period on schedule, working within the customer’s timeline and budget and meeting minimal downtime to the ongoing business operations.

Business Continuity, Recovery Services & Co-Location

Business Challenge:

A rapidly growing U.S. based retail corporation maintained a production data center in New York State. While they had always cut daily incremental backup tapes and weekly “fulls”…and in turn sent them offsite to a secure location, they never had a contracted warm or hot site facility from which to recover their critical applications in the event of a disaster at the main production facility. Several years ago the lack of a contract at a hot site recovery facility never seemed a major issue for a small retail company…just more of a potential minor inconvenience. As the company grew, in fact doubled and tripled in size, it became all the more apparent and actually critical to come up with a more effective and comprehensive business recovery plan.

The Solution:

The ComSource sales and support team went to work immediately. First and foremost, ComSource and their business recovery team experts worked with the company to examine all workload and applications with the intent of prioritizing which applications absolutely needed to be up and running in hours versus days in the event of a “disaster”. The company’s key applications were hosted on both IBM’s Power family with OS/400 applications as well as several applications running on Dell x86 Servers. Once ComSource and their recovery expert team collectively completed a full audit of all hardware platforms, all critical and non-critical applications and all current backup and recovery infrastructure they jointly selected one of ComSource’s elite Recovery Site locations in northern Georgia for the hot site facility. In this case, the ComSource long time affiliate, a true “Best in Class” Disaster Recovery Services organization, provided the customer with the best overall top to bottom recovery option with a secure facility, redundant components, extensive equipment inventory and a staff expertise across all of the end user platforms..

Results:

Dedicated platforms were selected and deployed processes were implemented to insure that in the event of a disaster at the production facility a rapidly growing retail organization could recover its mission critical applications quickly and efficiently. This long time valued ComSource customer has continued to utilize this premier Disaster Recovery organization and has performed many complete recovery tests over several years. The end user’s executive team can now “sleep at night” knowing that in the event of a disaster…most any disaster…the company can bring up and run all selected applications in a timely fashion with a highly skilled support team working closely with them along the recovery process.

Information Technology in the Healthcare Sector

Business Challenge:

A 528-bed tertiary care facility in western New York needed to successfully implement an EMR solution. ComSource, along with our partner affiliate, competed with top IT healthcare solution providers and consultants to win this major project that required significant pre-implementation planning, management and support to help deploy the mission critical EPIC software.

The Solution:

Due to timeline sensitivity and federal mandates, ComSource and our partner affiliate were tasked to successfully implement the EPIC software by providing the key services listed below:

  • Planning and implementation pre-planning
  • System’s analysis
  • Change management
  • Screen/report design
  • Tailoring/configuration
  • Integration testing
  • Training
  • Activation planning
  • Post implementation review

Results:

ComSource was able to assist this tertiary care facility in achieving their targeted deadlines and obtaining full funding of the project. The facility realized significant cost savings by choosing our ComSource partner affiliate over other alternative IT healthcare systems integrators. This successful EPIC implementation helped the client attain meaningful use objectives in a cost effective manner. In addition, the doctors and hospitals were able to report required quality measures that demonstrate outcomes, such as:

  • Improved process efficiencies
  • Maximized use of human resources
  • Improved “return on investment” on the technology purchase
  • Employee satisfaction
  • Physician satisfaction
  • Improved clinical quality outcomes
  • Increased case flow
  • Improved profitability
  • Improved patient care and safety

Mobile Technology and Logistical Solutions

Business Challenge:

A leading freight and logistics provider needed to reduce their use of paper through the full delivery cycle, improve their customer’s view time for payment on deliveries online and increase efficiency among drivers, IT support staff and employees completing back-office procedures. The “partial” paper based system being used by this company was creating inefficiencies such as, data loss, lack of quality control and wasted driver time.

The Solution:

ComSource and our partner affiliate coordinated with all levels of the corporate structure to create a new solution. This interactive process allowed the employees to see how the new processes directly affected their jobs and incorporated their requested features in the new system. A complete mobility solution was implemented to allow the company to automate their entire delivery and collection process in real time. Key elements of this solution include:

  • Drivers were able to scan items both within and outside the depot.
  • Consignments were manifested electronically.
  • “Sign-on glass” allowed the company to collect proof of delivery as well as accept and complete pickups in the field.
  • Information was instantly transferred to back office systems which increased functionality for staff in regard to schedules,
    deliveries, collections and depot operations.
  • Handheld remote mobile hardware and software assets allow support staff to access the device to assist the courier
    as needed. If a device is stolen it can be wiped of any sensitive customer information or corporate data remotely.

Results:

This provider benefited from the new mobility solution in the following ways:

  • Significant cost savings through ongoing maintenance, processing infrastructure, “rate of return” and equipment repairs.
  • Improved speed and efficiencies receiving deliveries, creating invoices and meeting the increasing demands of customers.

Information Technology Assessments

Business Challenge:

A nationally recognized retail corporation selected ComSource as a “checks and balances” to evaluate the performance of their current network and propose an architectural strategy that was both redundant and secure while requiring less maintenance. This company had a fast growing retail business and needed to ensure that their environment could support their current rate of growth.

The Solution:

ComSource assessed the current network design with an onsite CCIE engineer and an array of tools. The network design assessed was:

  • IP Addressing Strategy
  • VLAN Strategy
  • Access Layer Switching Strategy
  • Distribution Layer Switching Strategy
  • Core Layer Switching Strategy
  • Wide Area Network Strategy
  • Internet Access Strategy

The infrastructure assessed was:

  • Cabling Infrastructure Strategy
  • System Security Strategy
  • Production Network Management Strategy

These assessments lead to recommendations from our CCIE engineer, to include:

  • Compressing large image files instead of just adding bandwidth.
  • MPLS for larger sites.
  • The utilization of QOS when used with VPNs.
  • Manual routing IDs were established in OSPF using loopbacks for stability.
  • Increasing MTU size on remotes to cut down on fragmentation in TCP.
  • Filtering with a dedicated firewall.
  • Eliminating single points of failure and simplify cabling by collapsing all switches within the datacenter, excluding top
    of rack switches, to 2 Core switches.
  • Network management solution to take configuration backups of all devices at regular intervals and push out
    mass configuration changes.

Results:

At the end of the assessment the customer had a clear road map as to how their network should continue to grow effectively in concert with their rapidly growing business enterprise. Strategic implementation of the recommended solutions increased throughput, functionality and security in conjunction with the expanding company.

3rd Party Maintenance and Support, Non OEM

Business Challenge:

A Fortune 1500 privately held cosmetic company was tasked by senior management executives to reduce costs in their data center. Knowing that IT maintenance contracts are subject to frequent annual price increases, often associated with renewals, this company reached out to ComSource for strategies on maintenance cost reduction.

The Solution:

ComSource, along with our trusted and recognized 3rd party maintenance service provider, looked at 2 corporate datacenter locations for this cosmetic company that had expiring IBM and Dell maintenance contracts and were able to help the company save over 40% on support in the first twelve months. Due to the cost savings from just one year of using 3rdparty maintenance with ComSource, this company expanded their portfolio and not only renewed the contracts for the same IBM and Dell equipment, but also added additional IBM, Dell and Brocade equipment to the existing contracts. The service levels provided to this cosmetic company were: a 3rd party maintenance coordinator to track expiration dates and adds/deletes, 7x24x365 hardware maintenance, local service depots, call-home, online portal for asset management and incident tracking. This online portal allows our customers to see contracts in place with our 3rd party maintenance provider across all platforms and gives the customer the ability to upload maintenance contracts that are held with other maintenance providers as well.

Results:

ComSource and our 3rd party maintenance provider allow our customers to show a cost savings across multiple platforms and all major manufacturers. Where a typical OEM increases maintenance costs, we are able to decrease (or maintain at a lower price point) the costs as the equipment ages. We work with our customers to keep the equipment on the floor instead of trying to “end of life” the equipment as so many OEM’s tend to do. In this specific case utilizing our 3rd party maintenance solution, this cosmetic company saved approximately 40% on their maintenance contract costs across their expanded IT portfolio.

 Source Link: http://comsourceny.com/resources/case-studies/

Tier 3 data center specifications checklist

This section of our two part series on tier 3 data center specifications deals with the power supply aspects.

As the most critical part of business, an organization needs to ensure 100% availability for its data center. This is why building a data center according to tier 3 data center specifications ensures a certain assured level of availability or uptime.

A data center built according to tier 3 data center specifications should satisfy two key requirements: redundancy and concurrent maintainability. It requires at least n+1 redundancy as well as concurrent maintainability for all power and cooling components and distribution systems. A component’s lack of availability due to failure (or maintenance) should not affect the infrastructure’s normal functioning.

These specifications have to be met only from the power, cooling and building infrastructure fronts till the server rack level. Tier 3 data center specifications do not specify requirements at the IT architecture levels. By leveraging the following steps, your data center’s power supply infrastructure can meet the tier 3 data center specifications.

Stage 1: Power supply from utility service provider

The Uptime Institute regards electricity from utility service providers as an unreliable source of power. Therefore, tier 3 data center specifications require that the data center should have diesel generators as a backup for the utility power supply.

An automatic transfer switch (ATS) automatically switches over to the backup generator if the utility power supply goes down. While many organizations have just a single ATS connecting a backup generator and power supply from the utility service provider, the tier 3 data center specifications mandate two ATSs connected in parallel to ensure redundancy and concurrent maintainability. The specifications however, don’t call for the two ATSs to be powered by different utility service providers.

Stage 2: Backup generators

Tier 3 data center specifications require the diesel generators to have a minimum of 12 hours of fuel supply as reserves. Redundancy can be achieved by having two tanks, each with 12 hours of fuel. In this case, concurrent maintainability can be ensured using two or more fuel pipes for the tanks. Fuel pipes can then be maintained without affecting flow of fuel to the generators.

Stage 3: Power distribution Panel

The power distribution panel distributes power to the IT load (such as servers and networks) via the UPS. It also provides power for non IT loads (air conditioning and other infrastructure systems).

Redundancy and concurrent availability can be achieved using separate power distribution panels for each ATS. This is because connecting two ATSs to a panel will necessitate bringing down both ATS units during panel maintenance or replacement. However, the tier 3 data center specifications require two or more power lines between each ATS and power distribution panel to ensure redundancy and concurrent maintainability. Similarly, each power distribution panel and UPS should also have two or more lines for the same purpose.

Stage 4: UPS

Power from the distribution panel is used by the UPS and supplied to the power distribution boxes for server racks as well as network infrastructure. For example, if a 20 KVA UPS is required for a data center, redundancy can be achieved by deploying two 20 KVA UPS or four 7 KVA UPS units. Redundancy can even be achieved with five 5 KVA UPS units.

The tier 3 data center specifications require that each UPS be connected to just a single distribution box for redundancy and concurrent maintainability. This ensures that only a single power distribution circuit goes down, in case of a UPS failure or maintenance.

Stage 5: Server racks

Each server rack must have two power distribution boxes in order to conform to tier 3 data center specifications. The servers in each rack should have dual power supply features so that they can connect to the power distribution boxes.

A static switch can be used for devices which lack dual power mode features. This switch takes in supply from both power distribution boxes and gives a single output. The static switch can transfer from a power distribution box to another in case of failures, within a few milliseconds.

About the author: Mahalingam Ramasamy is the managing director of 4T technology consulting, a company specializing in data center design, implementation and certification. He is an accredited tier designer (ATD) from The Uptime Institute, USA and the first one from India to get this certification.

Redundancy: N+1, N+2 vs. 2N vs. 2N+1

A typical definition of redundancy in relation to engineering is:  “the duplication of critical components or functions of a system with the intention of increasing reliability of the system, usually in the case of a backup or fail-safe.”   When it comes to datacenters,  the need for redundancy focuses on how much extra or spare power the data center can offer its customers as a back up during a power outage.  Unexpected power outages are the overwhelming usual cause for datacenter downtime.*

 

Photo Courtesy of the Ponemon Institute

Photo Courtesy of the Ponemon Institute

According to the industry-leading Ponemon Institute’s 2013 Study on Data Center Outages (or “downtime” – a four-letter word in the data center industry) that surveyed 584 individuals in U.S. organizations who have responsibility for data center operations in some capacity, from the “rank and file” to C-Level, 85% participants report their organizations experienced a loss of primary utility power in the past 24 months.  And of that 85% – 91% reported their organizations had an unplanned outage.  That means that most data centers experienced downtime in the last 24 months.  During these outages respondents averaged two complete data center shutdowns, with an average downtime of 91 minutes per failure.

The entire study also speaks of the implementation and the impact of DCIM (Data Center Infrastructure Management) – and how it was used to fix or correct the root cause of the outages.*

The most common are due to the weather, but they can also occur from simple equipment failure or even an accidental cutting of a power line due to a backhoe.  No matter what the reasons are, an unplanned outage can cost a company a lot of money, especially if there revenues are dependent upon internet sales.

For example, if you’re Amazon and you go down, you lose a mind-blowing amount of money: an estimated $1,104 in sales for every second of downtime. The “average” U.S. data center loses $138,000 for one hour of data center downtime per year.  So, if one compares Ponemon’s 91-minute average downtime per year – that’s an approximate loss of $207,000 for each organization accessing the data center.

What’s this all mean?  Downtown matters, and downtime prevention matters, so Redundancy matters.

Perferably, large businesses and corporations  have their servers set up at either Tier 3 or Tier4 data centers because they offer a sufficient amount of redundancy in case of a unforeseen power outage. With this in mind, not all data center’s redundancy power systems are created equal.  Some offer N+1, 2N, and 2N +1 redundancy systems.

What’s the Difference Between N+1, 2N and 2N+1?

The simple way to look at N+1 is to think of it in terms of throwing a birthday party for your child or yourself, because who doesn’t love cupcakes?.  Say you have ten guests and need ten cupcakes, but just in case you have that  “unexpected” guest show up, you order eleven cupcakes.  “N” represents the exact amount of cupcakes you need, and the extra cupcake represents the +1.  Therefore you have N+1 cupcakes for the party.  In the world of datacenters, an N+1system, also called parallel redundancy, and is a safeguard to ensure that an uninterruptible power supply (UPS) system is always available. N+1 stands for the number of UPS modules that are required to handle an adequate supply of power for essential connected systems, plus one more, so 11 cupcakes for 10 people, and less chance of downtime.

Although an N+1 system contains redundant equipment, it is not, however, a fully redundant system and can still fail because the system is run on  common circuitry or feeds at one or more points rather than two completely separate feeds.

Back at the birthday party!  If you plan a birthday party with a 2N redundancy system in place, then you would have the ten cupcakes you need for the ten guests, plus an additional ten cupcakes, so 20 cupcakes.  2N is simply two times, or double the amount of cupcakes you need.   At a data center, a 2N system contains double the amount of equipment needed that run separately with no single points of failure.  These 2N systems are far more reliable than an N+1 system because they offer a fully redundant system that can be easily maintained on a regular basis without losing any power to subsequent systems.  In the event of an extended power outage, a 2N system will still keep things up and running.  Some data centers offer 2N+1, which is actually double the amount needed plus an extra piece of equipment as well, so back at the party you’ll have 21 cupcakes, 2 per guest and 3 for you!

 For more information on Redundancy, N+1, 2N, 2N+1, and the difference between them, as well as, the different Tier levels offered by datacenters around the world visit www.datacenters.com or call us at (877) 406-2248.

*Sources: Ponemon Institute 2013 Study on Data Center Outages: Sponsored by Emerson Network Power – This link will take you to the entire study, which is also an interesting read about the how data center employees view their data center’s structure and superiors.

Survey: UPS Issues Are Top Cause of Outages

This chart shows the perception gap between the executive suite and data center staff on key issues (click for larger version).

Problems with UPS equipment and configuration are the most frequently cited cause of data center outages, according to a survey of more than 450 data center professionals. The survey by the Ponemon Institute, which was sponsored by Emerson Network Power, also highlights a disconnect between data center staff and the executive suite about uptime readiness.

The National Survey on Data Center Outages surveyed 453 individuals in U.S. organizations who have responsibility for data center operations, who were asked about the frequency and root causes of unplanned data center outages, as well as corporate efforts to avert downtime. Ninety five percent of participants reported an unplanned data center outage in the past two years, with most citing inadequate practices and investments as factors in the downtime.

Here are the most frequently cited causes for downtime:

  • UPS battery failure (65 percent)
  • Exceeding UPS capacity (53 percent)
  • Accidental emergency power off (EPO)/human error (51 percent)
  • UPS equipment failure (49 percent)

There were signs that the ongoing focus on cost containment was being felt in the data center. Fifty nine percent of respondents agreed with the statement that “the risk of an unplanned outage increased as a result of cost constraints inside our data center.”

“As computing demands and energy costs continue to rise amidst shrinking IT budgets, companies are seeking tactics – like cutting energy consumption – to cut costs inside the data center,” said Peter Panfil, vice president and general manager, Emerson Network Power’s AC Power business in North America. “This has led to an increased risk of unplanned downtime, with companies not fully realizing the impact these outages have on their operations.”

Perception Gap
The focus on UPS issues isn’t unexpected, given the role of uninterruptible power supplies in data center power infrastructure. It’s also consistent with Emerson’s position as a leading vendor of UPS equipment. But the survey byPonemon, which is known for its surveys on security and privacy, also points to a perception gap between senior-level and rank-and-file respondents regarding data center outages.

Sixty percent of senior-level respondents feel senior management fully supports efforts to prevent and manage unplanned outages, compared to just 40 percent of supervisor-level employees and below. Senior-level and rank-and-file respondents also disagreed regarding how frequently their facilities experienced downtime, with 56 percent of the senior executives believing unplanned outages are infrequent, while just 45 percent of rank-and-file respondents agreed to the same statement.

“When you consider that downtime can potentially cost data centers thousands of dollars per minute, our survey shows a serious disconnect between senior-level employees and those in the data center trenches,” said Larry Ponemon, Ph.D., chairman and founder of the Ponemon Institute. “This sets up a challenge for data center management to justify to senior leadership the need to implement data center systems and best practices that increase availability and ensure the functioning of mission-critical applications. It’s imperative that these two groups be on the same page in terms of the severity of the problem and potential solutions.”

Source: http://www.datacenterknowledge.com/archives/2010/10/13/survey-ups-issues-are-top-cause-of-outages/

 

Data Center Generators

Generators are a key to data center reliability. Supplementing a battery-based uninterruptible power supply (UPS) with an emergency generator should be considered by all data center operators. The question has become increasing important as super storms such as Hurricane Sandy in the Northeast United States knocked out utility power stations and caused many downed power lines, resulting in days and weeks of utility power loss.

data-center-generator-delivery
Data Center Generator Delivery

Beyond disaster protection, the role of a backup generator to provide power is important when utility providers consider summer rolling blackouts and brownouts and data center operators see reduced utility service reliability. In a rolling blackout, power to industrial facilities is often shut down first. New data center managers should check the utilities contract to see if a data center is subject to such utility disconnects.

Studies show generators played a role in between 45 and 65 percent of outages in data centers with an N+1 configuration (with one spare backup generator). According to Steve Fairfax, President of MTechnology, “Generators are the most critical systems in the data center.” Mr. Fairfax was the keynote speaker at the 2011 7×24 Exchange Fall Conference in Phoenix, Arizona.

What Should You Consider Before Generator Deployment?

  • MTU-Onsite-Energy-Data-Center-Gas-Generators
    MTU Onsite Energy Gas Generator

    Generator Classification / Type. A data center design engineer and the client should determine if the generator will be classified as an Optional Standby power source for the data center, a Code Required Standby power source for the data center, or an Emergency back-up generator that also provides standby power to the data center.

  • Generator Size. When sizing a generator it is critical to consider the total current IT power load as well as expected growth of that IT load. Consideration must also be made for facility supporting infrastructure (i.e. UPS load) requirements. The generator should be sized by an engineer, and specialized sizing software should be utilized.
  • Fuel Type. The most common types of generators are diesel and gas. There are pros and cons to both as diesel fuel deliveries can become an issue during a natural disaster and gas line feeds can be impacted by natural disasters. Making the right choice for your data center generator depends on several factors. The fuel type needs to be determined based upon local environmental issues, (i.e. Long Island primarily uses natural gas to protect the water aquifer under the island), availability, and the required size of the standby/emergency generator.
  • Deployment Location. Where will the generator be installed? Is it an interior installation or an exterior installation? An exterior installation requires the addition of an enclosure. The enclosure may be just a weather-proof type, or local building codes may require a sound attenuated enclosure. An interior installation will usually require some form of vibration isolation and sound attenuation between the generator and the building structure.
  • Cummins-Lean-Burn-Industrial-Gas-Generators
    Cummins Lean-Burn Gas Generator

    Exhaust and Emissions Requirements. Today, most generator installations must meet the new Tier 4 exhaust emissions standards. This may depend upon the location of the installation (i.e. city, suburban, or out in the country).

  • Required Run-time. The run-time for the generator system needs to be determined so the fuel source can be sized (i.e. the volume of diesel or the natural gas delivery capacity to satisfy run time requirements).

 

What Should You Consider During Generator Deployment?

  • Commissioning The commissioning of the generator system is basically the load testing of the installation plus the documentation trail for the selection of the equipment, the shop drawing approval process, the shipping documentation, receiving and rigging the equipment into place. This process also should include the construction documents for the installation project.
    Generac-industrial-gas-generators
    Generac Generator

     

     

  • Load Testing Typically, a generator system is required to run at full load for at least four (4) hours. It will also be required to demonstrate that it can handle step load changes from 25% of its rated kilowatt capacity to 100% of its rated kilowatt capacity. If the load test can be performed with a non-linear load bank that has a power factor that matches the specification of the generator(s) that is the best way to load test. Typically, a non-linear load bank with a power factor between 75% and 85% is utilized.
  • Servicing The generator(s) should be serviced after the load test and commissioning is completed, prior to release for use.

 

What Should You Consider After Generator Deployment?

  • Caterpillar Industrial Diesel GeneratorsService Agreement. The generator owner should have a service agreement with the local generator manufacturer’s representative.
  • Preventative Maintenance. Preventative Maintenance should be performed at least twice a year. Most generator owners who envision their generator installation as being critical to their business execute a quarterly maintenance program.
  • Monitoring. A building monitoring system should be employed to provide immediate alerts if the generator and ATS systems suffer a failure, or become active because the normal power source has failed. The normal power source is typically from the electric utility company, but it could be an internal feeder breaker inside the facility that has opened and caused an ATS to start the generator(s) in an effort to provide standby power.
  • Regular Testing. The generator should be tested weekly for proper starting, and it should be load tested monthly or quarterly to determine that it will carry the critical load plus the required standby load and any emergency loads that it is intended to support.
  • bloom-energy-server
    The Bloom Box by Bloom Energy

    Maintenance. The generator manufacturer or third party maintenance organization will notify the generator owner when important maintenance milestones are reached such as minor rebuilds and major overhauls. The run hours generally determine when these milestones are reached, but other factors related to the operational characteristics of the generator(s) also apply to determining what needs to be done and when it needs to be done.

PTS Data Center Solutions provides generator sets for power ratings from 150 kW to 2 MW. We can develop the necessary calculations to properly size your requirement and help you with generator selection, procurement, site preparation, rigging, commissioning, and regular maintenance of your generator.

To learn more about PTS recommended data center generators, contact us or visit (in alphabetical order):

To learn more about PTS Data Center Solutions available to support your Data Center Electrical Equipment & Systems needs, contact us or visit:

Link Source: http://computer-room-design.com/strategic-data-center solutions/electricalequipmentandsystems/data-center-generators/

Disaster Recovery Planning in IT Management

Disaster Recovery Planning in IT Management

Disaster recovery planning is the mechanism by which are anticipated and addressed. Just what is a “technology related disaster”?  Oddly enough, the first challenge in the planning process is to quantify the meaning of the word in the IT management context.

In IT, can be any unexpected problem that results in a slowdown, interruption or failure in a key system or network.  These problems can be caused by natural disasters (i.e. fire, earthquake, hurricane…), technology failures, malicious acts, incompatibilities, or simple human error.  Whatever the cause, service outages, connectivity failures, data loss, and related technical issues can disrupt business operations, causing lost revenues, increased expenses, customer service problems, and lowered workplace productivity.  IT disaster recovery planning strategies must be created to respond to these varied realities and perceptions.  To that end, these strategies must address three (3) basic needs:

  • Prevention (to avoid and minimize disaster frequency and occurrence).
  • Anticipation (to identify likely disasters and related consequences).
  • Mitigation (to take steps for managing disasters to minimize negative impact).

Action Item: It’s time to get your disaster recovery plans underway with the steps and techniques provided in our full IT Service Strategy Toolkit.

Fundamental Planning Goals and Objectives

There is no doubt that can offer many benefits to a business. Once you acknowledge the value of technology to your organization, you must also consider the related consequences if and when that technology becomes temporarily unavailable, or totally inaccessible.  Your ability and willingness to address these issues can offer several key operational benefits:

  • To minimize the negative impact of any disaster.
  • To save time and money in the recovery process in the event of a disaster.
  • To provide for an orderly recovery process, reducing “panic” decision making.
  • To protect technology assets owned by a business, maximizing ROI.
  • To minimize legal or regulatory liabilities.
  • To promote systems and IT service quality, reliability and security.
  • To promote the value of technology and related IT services within your organization.
  • To promote management awareness, and to set realistic expectations about the need for systems management tools and resources.

Disaster Recovery Planning in Practice

In the IT management context, there are many levels to defining “disaster” and multiple options to address each level.  To make things easier, the broad view of disaster recovery can be broken down into three (3) primary planning options –prevention, anticipation and mitigation.

Prevention:  Avoiding Disaster Events to the Extent Possible

The goal of “preventative” disaster recovery planning is to ensure that all key systems are as secure and reliable as possible, in order to reduce the frequency or likelihood of “technology related disasters”. Since natural disasters usually lie outside our sphere of influence, prevention most often applies to systems problems and human errors, to include physical hardware failures, software bugs, configuration errors and omissions, and acts of malicious intent (virus attacks, security violations, data corruption…). Using the right set of tools and techniques, it is possible to preclude both the occurrence and related damage from any and all of these sorts of “disasters”.

Anticipation:  Planning for the Most Likely Events

Anticipation strategies revolve around “assumptions” …. the ability to foresee possible disasters, in order to identify possible consequences and appropriate responses. Without a crystal ball, contingency planning can be a challenging process. It involves knowledge and careful analysis. Knowledge is derived from experience and information …. understanding the systems you have, how they are configured, and what sort of problems or failures are likely to occur. And the related analysis involves a careful balancing of circumstances and consequences.

Mitigation:  Get Ready to React and Recover

Mitigation is all about “reaction and recovery” …. the ability to respond when and if a disaster occurs. Accepting that certain disasters are unavoidable, and perhaps inevitable, the goal of any mitigation strategy is to minimize negative impact.

  1. Maintain current technical documentation to facilitate recovery should a problem occur.
  2. Conduct regular tests of your disaster recovery plans and strategies.
  3. Keep loaner equipment available for immediate use.
  4. Create regular back-ups of applications, data and hardware configurations.
  5. Maintain an “alternative workplace plan” to allow designated staff to work from home or other locations.
  6. Identify manual or standalone operating procedures in the event of a prolonged outage.
  7. Coordinate IT disaster recovery plans with other corresponding emergency, security and employee safety programs/policies.

http://infochief.com.vn/ &   http://it-toolkits.org/

Open source systems management tools

If your IT shop has the right skills, open source systems management tools may be a fit for your data center and save money over proprietary solutions. This slides will show features some of the top tools.

Large IT organizations turn to open source systems management tools

Top areas where open source systems management tools used
(Click here for a larger version)

Usenix, a systems administration user group, and Zenoss, an open source systems management vendor, recently completed a survey on open source systems management software use between 2006 and 2009. Respondents were attendees of the organization’s Large Installation System Administrators conference. Nearly all respondents use or plan to use open source systems management tools, with many shops turning toNagios, Cacti, Zabbix, GroundWork and the OpenNMS project. When asked “What are the top areas where you plan to use open source systems management tools?” 90% answered monitoring, around 60% said configuration and around 50% said patch management.

The benefits of open source systems management

Top reason for using open source software
(Click here for a larger image)

When asked the question “Why did you or would you be likely to try open source software?” responding shops said that they have turned to open source systems management tools to reduce costs and increase flexibility. Easy deployment was also a top reason for trying open source. In 2006, only 26% of survey respondents indicated this as a reason for using open source; in 2009, however, 71% of all respondents indicated this as a reason for using open source. This finding may indicate that open source not only removes technical hurdles but also preempts some of the bureaucratic obstacles associated with the traditional technology procurement process.”Open source offerings are newer and often written to be easier to deploy than older systems,” said Michael Coté, an analyst at RedMonk, an industry analyst firm. “An admin can download and install it without asking for funding, agreeing to any terms for a trial or filling out registration forms. Being able to download a piece of software by right-clicking is going to be easier than most other acquisition paths.”

The drawbacks to open source systems management

Top reasons for not using open source
(Click here for a larger image)

So what are the primary reasons IT shops would not use open source tools? Lack of support was the main culprit, and users said proprietary tools had better support and product maturity as well as less risk.”You get the support you pay for,” Coté said. “If you don’t want to pay anything, just download Nagios, OpenNMS or Zenoss Core and go at it alone. You’ll be paying in your time: time to ask questions in forums and wait for answers, time to look through existing write-ups on the Web, and, if you’re of the right kind of mind, time to look through the code yourself. Closed-source offerings can seem to have more support available because you’re required to buy support.”

Ed Bailey, a Unix team lead at a major credit reporting agency, uses the proprietary version Hyperic HQ Enterprise to manage Web applications that drive his company’s revenue. Bailey said he doesn’t have the time to cobble together — let alone develop and maintain — the automation, security and reporting features that ship with the enterprise version. “You can make a reporting system for the open source version of Hyperic HQ. If you have the time, you can make anything. But our company is more focused on things that generate revenue rather than me spending time working on this,” Bailey said. “I used to work at a university and we had time to build something like that, whereas now we have millions of transactions that are making money.”

Special skills to use open source systems management tools?

What skill set do sys admins need to have to deploy systems management software successfully in an IT organization? “Any scripting experience in general is helpful,” said Ryan Matte, a data center admin at Nova Networks Inc. “Basic Python knowledge is very helpful when using Zenoss. I often use Bash scripting as well. A decent understanding of SNMP [Simple Network Management Protocol] is definitely required (since the open source products don’t tend to be as automated as the enterprise products). I often find myself developing custom SNMP monitoring templates for devices, [but] … you should have an understanding of whatever protocols you are working with. An understanding of Linux/BSD [Berkeley Software Distribution] is helpful as well since most of the open source monitoring products that I’ve seen only run on Linux/BSD.”

Virtualization driving proprietary management tool dominance

% of respondents who cite that product features have become the more important advantage of proprietary software
(Click here for a larger version)

Starting in 2009, a much larger percentage of data center managers indicated proprietary systems management software has an advantage over open source tools in advanced product features. In 2009, 33% of all respondents indicated that product features played a bigger part in defining the advantages of commercial tools, versus 10% in the previous year. Though not explicitly spelled out in the survey, you can translate product features to “virtualization management features.” Matte is using Zenoss’ open source offering, Zenoss Core, and said he has evaluated Zenoss’ proprietary enterprise ZenPacks, which have virtual machine management features. “I have taken a look at the enterprise ZenPacks, and there is nothing like the VMware [Management] Pack in the open source community,” Matte said.

Open source systems management profile: Spacewalk

Spacewalk

Spacewalk is an open source Linux systems management tool and the upstream community project from which the Red Hat Network Satellite product is derived. Spacewalk provides provisioning and monitoring capabilities as well as software content management.James Hogarth, a data center admin in the U.K., uses Spacewalk to manage 100 hosts in a CentOS-based environment for an entertainment website built on the Grails distribution. Hogarth said his company’s entire environment is focused on open source software — even migrating server virtualization from VMware to the Red Hat Kernel-based Virtual Machine (or KVM) hypervisor — and that open source focus was a major factor in the decision to use open source systems management tools.

Hogarth said he’s run into some gotchas and issues that needed a workaround, but overall Spacewalk has lightened his support workload. Most of the development is done by Red Hat personnel, and the developers are often available to answer questions and troubleshoot issues. “People are very responsive [on the support forum], and it’s relatively rare that you don’t get a response,” Hogarth said. “Over the last two years, the product has really matured.”

Open source data center automation and configuration tools

Puppet is one option

In the open source space, Cfengine and Puppet are leading data center automation and configuration tools. In 1993, Mark Burgess at Oslo University College wrote Cfengine, which can be used to build, deploy, manage and audit all the major operating systems. Cfengine boasts somelarge customers, including companies such as eBay and Google. Cfengine offers a proprietary commercial version called Cfengine Nova. As an open source-only product, Puppet takes a different approach, and its creators, Puppet Labs, make money through training and support.Puppet founder Andrew Schafer, for example, wrote a column on Puppet and how it works. Also, James Turnbull recently wrote a book on using Puppet in the data center. Turnbull has also written tips on Puppet, including the recent article on using the Puppet dashboard. The Oregon State University Open Source Laboratory uses Cfengine for systems management but planned to move to Puppet. “From a technical point of view, Puppet offers more flexibility and an ability to actually use real code to deal with tasks. Cfengine has its own syntax language, but it’s not really suited for complex tasks,” said OSUOSL administrator Lance Albertson in an interview earlier this year.

Open core versus open source software

Some companies offer what’s considered “open core” systems management software. At the base level is a functional, free open source tool (like Zenoss Core or Hyperic HQ), and there is a separate proprietary enterprise version with special add-ons and features. This business model rankles some open source advocates, but it offers companies the chance to use a tool risk free, and oftentimes organizations can make the free version work.Ryan Matte, a data center admin at Ottawa, Ontario-based Nova Networks Inc., uses Zenoss Core to manage more than 1,000 devices, monitoring Windows, Linux, Solaris and network devices. Matte considered Nagios, Zabbix, and OpenNMS. “In terms of ease of use and setup and having all the monitoring capabilities in the product, Zenoss was the best choice,” he said. “There’s an IRC channel chat room — I’m in there quite a bit. There are always people in there. The [community] support is pretty good, but you have to come in during business hours.”

Using Webmin for data center server management

Webmin

Webmin offers a browser-based interface to Unix and Linux operating systems. It can configure users, disk quotas, services or configuration files as well as modify and control open source apps. Here are some tips on using Webmin:

Using Nagios in the data center to manage servers

Nagios

In many data center environments, Nagios has become the de facto standard for companies in need of an open source, fault-tolerant solution to monitor single points of failure, service-level agreement shortcomings, servers, redundant communication connections or environmental factors. But is this one-size-fits-all open source tool best suited to your data center? Here are some SearchDataCenter.com tips on Nagios:

Best Practices for data center monitoring and server room monitoring

1. Rack Level Monitoring
Based on a recent Gartner study, the annual cost of a Wintel rack averages around $70,000 USD per year. This excludes the business cost of a rack. Risking losing business continuity or your infrastructure due to environmental issues is not an option. What are the environmental threats at a rack level?A mistake often made is to only rely on monitoring the conditions at a room level and not at a rack level. The American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) recommends no less than 6 temperature sensors per rack in order to safeguard the equipment (top, middle, bottom at front & back of rack). When a heat issue arises, air conditioning units will initially try to compensate the problem. This means that with room level temperature monitoring, the issue will only be detected when the running air conditioning units are no longer capable of compensating the heat problem. It may be too late then.We recommend monitoring temperature per rack at a minimum of 3 points: at the bottom front of the rack to verify the temperature of the cold air arriving to the rack (combined with airflow monitoring); at the top front of the rack to verify if all cold air gets to the top of the rack; and finally one at the top back of the rack which is typically the hottest point of the rack. Intake temperature should be between 18°-27°C / 64°-80°F. Outtake temperature should typically be not more than 20°C / 35°F of the intake temperature.

What is the impact of temperature on your systems? High end systems have auto shutdown capabilities to safeguard themselves against failures when temperature is too high. However before this happens systems will experience computation errors at a CPU level resulting in application errors. Then system cooling (fan) will be stressed reducing equipment life time expectance (and as such their availability and your business continuity).

   2. Ambient room monitoring
Ambient room monitoring is the environmental monitoring of the room for its humidity and temperature levels. Temperature and humidity sensors are typically deployed in:

  • potential “hot zones” inside the server room or data center
  • near air conditioning units to detect failure of such systems.When multiple air conditioning systems are available in a room, then a failure of one system will initially be compensated by the others before it may lead to a total failure of the cooling system due to overload. As a result temperature / airflow sensors are recommended near each unit to get early failure detection.Humidity in server rooms should be between 40% and 60% rH. Too dry will result in the build up of static electricity on the systems. Too humid and corrosion will start slowly damaging your equipment resulting in permanent equipment failures.

    When using cold corridors inside the data center, then ambient temperature outside the corridor may be at higher levels. Temperatures of 37°C / 99°F are not uncommon in such setups. This allows to significantly reduce the energy cost. However this also means that temperature monitoring is of utmost importance as a failing air conditioning unit will have a way faster impact on the systems lifetime and availability (fans stress, CPU overheating, …) and running a room at higher temperatures may also affect non rack mounted equipment.

    When using hot corridors it is important to monitor temperature across the room to ensure that sufficient cold air gets to each rack. In this case however one can also rely on rack based temperature sensors in addition of temperature and humidity sensors close to each air conditioning unit.

    3. Water & Flooding Monitoring

    Water leakage is a less known threat for server rooms & data centers. The fact that most data centers and server rooms have raised floors makes the risk even bigger as water seeks the lowest point.

    Two type of sensors for water leakage can be commonly found: spot and water snake cable based. Spot sensors will trigger an alert when water touches the unit. Water rope or water snake cable sensors use a conductive cable whereby contact at any point on the cable will trigger an alert. The latter type is recommended over the first one due to its higher range and higher accuracy.

    If using a raised floor, then one should consider putting the sensor under the raised floor as water seeks the lowest point.

    The four main sources of water in a server room are:

  • leaking air conditioning systems: a water sensor should be placed under each AC unit
  • water leaks in floors or roof above the data center & server room: water sensors should be put around the perimeter of the room at around 50cm/3ft from the outer walls
  • leaks of water pipes running through server rooms: a water sensor should be placed under the raised floors
  • traditional flooding: same as second point for water leaks from roof or above floors applies                                                                                                                                                                                                                                               4. Sensors DeploymentAll sensors connect to our Sensorgateway (base unit). A base unit supports up to 2 wired sensors, or up to 8 with the optional sensor hub.
    Application Location Setting SKU Sensor Package
    Rack Level Monitoring
    Sensors to monitor intake temperature Front – Bottom of rack for room or floor cooling, top of rack for top cooling 18-27°C / 64-80°F 182668 Temperature probes*
    Sensors to monitor outtake temperature Back – Top of rack (hot air climbs) less than 20°C / 35°F difference from inlet temperature (typically <40°C / 105°F) 182668 Temperature probes*
    Ambient Monitoring
    Temperature & humidity monitoring in server room small server rooms: center of the room data centers: potential hot zones – furthest away from airco units Temperature depends on type of room setup
    Humidity: 40-60% rH
    306166 Temperature & Humidity Sensor Probe*
    Airconditioning Monitoring
    Early detection of failing air conditioning units next to airco units Temperature depends on setting of airco
    Humidity: 40-60% rH
    306166 Temperature & Humidity Sensor Probe*
    Water Leaks / Flooding
    Detecting water leaks coming from outside of room Around outside walls of server room / data center and under raised floor
    best is to keep a 30-50cm / 10-20″ from outer wall
    180004 Flooding Sensor Probe* with 6m/20ft water sensitive cable
    Detecting water leaks from air conditioning units Under each air conditioning unit 180004 Flooding Sensor Probe* with 6m/20ft water sensitive cable

    * External probes need to be connected to a Sensorgateway (SKU 311323) in order to operate. One Sensorgateway has a built-in temperature probe and can support up to 2 external probes.

Source from: https://serverscheck.com/sensors/temperature_best_practices.asp