A glimpse into the life of IT

January 5, 2012

VDI for me. Part 1

Filed under: VDI,virtualization,VMware View — ITforMe @ 3:51 pm

 

It doesn’t take long for those in IT to see the letters “VDI” show up on the radar.  Maybe a morning digest of RSS or twitter feeds might be all it takes.  But the subject of Desktop Virtualization can be a difficult puzzle to piece together if you haven’t already invested in it.  There is a lot to sort through, even if you are contemplating just a small pilot project.  VDI deployments can be a highly visible IT project (in user expectations, experiences, and financial commitment), so if you are contemplating a pilot project, you’ll want to stack the odds of success in your favor, and make that first impression a good one.  As I embarked on my own VDI project, I noticed that while there was much to be found on the net regarding VDI in general, one thing seemed remarkably absent; information on how pilot projects are approached, and actually deployed.  My posts won’t be dissecting any CAPEX or OPEX numbers, nor will it attempt to prove that VDI is the perfect solution for every scenario.  But rather, I want to give you a glimpse to what a VDI pilot project actually looks like. 

As I mentioned on previous posts, my recent efforts to upgrade my hypervisor to vSphere 5 was for one real purpose.  I wanted to deploy a Proof of Concept (PoC) for VDI in our environment using VMware View 5.  The company I work for makes industry leading data visualization and simulation analytics software.  Our customers, who are often scientists and engineers, need to understand simulation results to make smart design decisions.  The software is typically installed on a user’s local workstation (Windows, Linux, Mac, or Unix) and interacts with data that is either on the network, or if performance dictates, their local system.  The nature of the problems that are trying to be solved, along with our software, can demand high computational horsepower with high-end graphics.  Many in the CAE/CAD industries are familiar with the demands of the local workstation performing the work.  But there are some scenarios with VDI that make it quite compelling that are not brought up very often.  We wanted to understand those benefits better.

Much of the interest internally transpired from a technology lunch-and-learn session I gave to our staff.  My motives for doing such a thing were two-fold.  1.)  I wanted everyone interested (not just key stakeholders, but software developers, technical support, etc.) to be aware of some things I knew they weren’t aware of, and 2.), I thought it would be fun.  I covered anything from how virtualization has changed the Datacenter, to demonstrating how common assumptions about our own internal systems, as well as our customers, may be highly misguided based off of new technologies.  While the focus was the Datacenter, I translated much of that into how these changed might affect end users.  It was a tremendous success, and even required an encore presentation for those who missed it.  This piqued the interest of the key stakeholders, as they understood it could have a significant impact on our product roadmap.  Combine this with the impressive “technology previews” of things like “AppBlast” and “Octopus” at the 2011 VMworld, as well as recent partnerships announced by VMware and NVIDIA for providing hardware based GPU acceleration to hypervisors, and we had something we needed to look into.

Objectives 
Due to the type of software we make, we had some unique use cases that we not only wanted to understand better ourselves, but perhaps provide our customers with solutions on a better way to do things.  I had several objectives I wanted to achieve.

  • Understand how we could improve on our own internal efficiencies and toolset to our staff.  We have taken a fairly aggressive approach already to provide ubiquitous access to everyone in our office (tablets for everyone, and their use encouraged in meetings, etc.).  We also have a great team of software developers on another continent, and are always looking for new ways to integrate them with our systems.  We needed to look at tools that not only help that, but help mitigate the the operational expenses and security challenges of traditional workstations to end users.  These seem to be the mostly common reasons cited when organizations are looking into VDI.
  • Demonstrate to our customers possibly “a better way” to work.  Believe it or not, many consumers of applications that have large data still either download the data locally, then work on it, or rely on local high speed networks to transmit the data.  What happens if the data, and the system used to present the data lived in the Datacenter?  These days, under a virtualized infrastructure, network communications may never even see an Ethernet cable or a physical switch, and the results can be impressive.  This is the benefit that most of us have enjoyed while virtualizing our server infrastructure.  So if the new type of client server arrangement only had to deal with transmitting screen data, then a world of possibilities opens up.
  • Demonstrate our software.  We rely heavily on feedback from new and existing customers on our products, and we need them to try our software.  If your customer base runs on extremely sensitive or secure networks, you’ll find that a pattern emerges.  They won’t let you install trial or pre-release software on any of their machines.  We’ve resorted to some pretty crude methods to help them test our software (shipping some old laptops), but obviously, this doesn’t scale well, and frankly, doesn’t give a good impression for a company who strives for innovation.  
  • Understand new display protocols, and how this could impact how datacenters are constructed.  Remote displays are a big concern in the visualization, CAE, and CAD worlds.  Years of dealing with various display protocols have resulted in similar experiences.  Most share the trait of using connection oriented protocols to transmit rendering instructions.  They tend to suffer on connections beyond the LAN that may have high latency and packet loss.  There are new approaches to help change this.

GPUs, Remote Displays, and Client Server
Most heavy hitting graphics software packages have historically relied HEAVILY on GPU power provided by a real video card.  This has been under the assumption that they’d always be there, and that GPUs are incredibly powerful.  Well, as you can imagine, that doesn’t play to the strengths of traditional virtualization, where video has barely been an afterthought (at least so far).  Not a big deal when you were virtualizing your server infrastructure, but a pretty big deal when you are trying to virtualized desktops.  Task workers may never notice the difference, but users of high end graphics will.

Remote display protocols come in a variety of forms.  Some are definitely better than others, but most suffer from fundamental issues of being based on TCP.  The rules of TCP guarantee reliable and orderly delivery of packets to be reassembled at the target.  Network communication wouldn’t have gotten very far if it didn’t have a way to guarantee delivery, and while this is great for most things, it doesn’t shine when it comes to video, animations, or high speed screen refreshes.

So why would a software company be so curious about this new set of technologies? Technologies that for the most part, we have no control of?  The answer is that it can change how one architects software. For example:

  • Sometimes there are calls for a true two-piece, client server software, where the client application and the server application are working together to perform their own dedicated tasks, but in different locations.  The goal with this type of client server arrangement was to use recourses as efficiently as possible. It is not as common as it once was, and history has demonstrated that this approach can be complex, wrought with problems, and very expensive to develop and maintain, especially when dealing with heterogeneous environments.  It may work in some environments, but not in others.  It may or may not use chatty protocols that simply don’t work well over poor network conditions. The further the two are separated, the more you introduce issues.  Firewalls, network traversal, etc. all complicate matters.
  • The other form of client server that people are most familiar with is where the application and computing power is in one location (e.g. The desktop) while the files being accessed are on a file server.  While it is very clear what is doing the work, it begins to falter with large data, or traversing geographic boundaries.  Look at the traffic generated by opening up a modest sized spreadsheet over a VPN connection, and you’ll see what I’m talking about.

With the trends on virtualizing infrastructures, the datacenters have collapsed to be more centrally located. If the client system (a workstation, or in this case, a VM) is now in the same virtual space as where the data lives (the server, or in this case, a VM), it can have an impact on application design.  Instead of using “client rendering” the model uses “on-host” rendering, where all of the heavy lifting is performed inside the Datacenter. Meanwhile, the endpoint devices, such as laptops, tablets, zero or thin clients, which do not need to have much horsepower, are really just acting in a presentation type role. It’s a bit like wondering how much CPU power your TV has.  Answer… It doesn’t matter, as the score of the baseball game is being rendered on-host.

The Plan
I wanted to make sure that I wasn’t claiming that VDI will change everything, and that every single person will be working off of a zero client in 12 months.  Nor was I suggesting that running a virtual desktop on a tablet is the ideal interface for interacting with a desktop.  That wasn’t my point.  It is ultimately how applications are delivered to the end user.  I wanted to help everyone understand that for the majority of our customers, they simply work on the systems that are provided to them by their IT Department.  Rarely does an end user get line-item-veto power on what they use for a computer, and since we know IT infrastructures are changing, the evidence suggests that what the customers will be using is going to change as well.  In fact, there is pretty good likelihood that we will have a customer who shows up one day to their office with nothing more than a zero client in front of them.

The idea behind the PoC was to invest a minimum amount to provide as much information possible to make smart decisions in the future.  In that spirit, I also plan on revealing results on these posts along the way as well.  I knew my initial project wasn’t going to include the time to look into every feature of a fully deployed VDI.  I was going to stick to the basics, and see how they work.

  • Test access to VM’s using VMware View as the connection broker (via PCoIP, and RDP)
  • Test access to high end physical workstations with VMware View as the connection broker. (via PCoIP, and RDP)
  • Test these systems from a variety of endpoints.  From existing PC’s, to zero clients, to Linux desktops and Mac clients, to wireless tablet devices and smart phones.
  • Test these systems from a variety of connection scenarios.  A connection to something on the LAN is very different than something on another continent. 

Bullet number 2 may have caught your eye.  Creating a unified remote display experience isn’t limited to just virtual machines.  With the power of PCoIP, and using VMware View as a connection broker, one can house high end workstations IN the Datacenter.  They would be a 1:1 assignment (1 active user to 1 workstation) like a traditional workstation, but they’d be close to the storage (perhaps even directly connected to the SAN if desired), and offer the full GPU capabilities of the workstation.  Access to them would be no different than if they were VMs.  In fact, the end user may never know if it is physical, or virtual.  It’s a provocative thought for users of CAD, solid modeling, or simulation analytics software.

Components of my Pilot Project

  • VDI software.  VMware View 5 Premier (their 10 client “starter pack”) running on vSphere 5.0 infrastructure.  This would offer the abilities of the connection broker, the variety of connection protocols (PCoIP, and RDP) among other features.
  • Back-end storage.  For our small pilot project, this was limited to our existing Dell EqualLogic SAN arrays.  They would be fine for this small pilot.  However, I have no illusions about the storage I/O requirements of VDI at a larger scale, and hope that if things go well, a super-fast EqualLogic hybrid SSD/SAS array is in our future.  More on the subject matter of storage and it’s importance to the success of VDI in future posts.
  • Firewalls.  Not to be overlooked, this plays a significant role in how you can present, and secure content.  I have the good fortune to be working with what I believe to be the best of breed.  Microsoft Forefront Threat Management Gateway (TMG) 2010 running on a Celestix MSA appliance.  (more on this in future posts)
  • HTML5 presenter.  While View 5 doesn’t natively support HTML5, I wanted to see what this was like.  Not only would give aid the ability to evaluate our software, but give our developers some insight on how HTML5 may play a part in virtualized, application delivery. (e.g. AppBlast.  Gee, can you tell I’m antsy to get me hands on this?) For this experiment, I’ll be using Ericom’s Access Now for VMware View.
  • Zero Client.  For this, I will be using a Wyse P20.
  • PCoIP host card.  This is an eVGA HD02 PCoIP host card installed on a higher end workstation I will be using to performance against a high end workstation sitting remotely in the Datacenter, and brokered by VMware View.
  • My Primary Site, and my Colocation site.  The CoLo site is not used for anything other than where my offsite SAN replicas go to. My plan is to change that.  Long term intentions are to house services that are more appropriate for that location, while in the near term, I will be housing part of my VDI pilot there.

Early lessons learned
If you are reading this post, more than likely you are well entrenched in the world of virtualization.  Never underestimate that the lack of knowledge in this arena by your coworkers or stakeholders.  About 3 years ago, I started giving a monthly IT review to everyone in our company on what is going on in IT.  This helps dispel the myths behind the giant IT curtain, and possible gives some insight as to the complexities of modern environments.  But no matter how much information I provide, I’m constantly challenged by these technologies, with the occasional question of, “now who is VMware, and what do they do?”  or “What’s a SAN?"  Be prepared to repeat your message several times over.

I would also emphasize VDI as not being an either/or scenario.  It is another form of a computing environment that provides unique capabilities to deliver applications and content.  We know that there are many vehicles for this already, and it continues to evolve.  So in other words, no need to make bold claims about VDI.  This also keeps you out of the business of predicting the future – not a favorable occupation in my book. 

As you have the opportunity to have users try out various use cases, you may have to throttle any over exuberance on the user’s behalf.  Large deployments are different than pilots.  The end user may see the brilliance of the solution, while the budget line owners see nothing but the large capital investment. 

Coming up
In upcoming posts, I’ll share how I chose to design a pilot VDI arrangement for our testing internally, externally, and how we plan to use it for our own internal needs, as well as our customers.  I have no idea how many parts this series is going to be, so bear with me.  What I hope to do is to give you a better understanding of a VDI pilot project in the real world, providing enough detail to be helpful with your own planning.

Resources
The VMware, NVIDIA Partnership announcement
http://www.vmware.com/company/news/releases/vmw-vmworld-emea-nvidia-joint-10-19-11.html

Planning for VDI has little to do with the Desktop
http://whiteboardninja.wordpress.com/2011/01/24/planning-for-vdi-has-little-to-do-with-the-desktop/ 

VMware vSphere 4.1 Networking Performance
http://www.vmware.com/files/pdf/techpaper/Performance-Networking-vSphere4-1-WP.pdf

PCoIP FAQ’s
http://www.teradici.com/pcoip/pcoip-technology/pcoip-faqs.php

November 17, 2011

Tips for using Dell’s updated EqualLogic Host Integration Tools – VMWare Edition (HIT/VE)

Filed under: SAN,virtualization — ITforMe @ 9:05 am
Tags: ,

 

Ever since my series of posts on replication with a Dell EqualLogic SAN, I’ve had a lot of interest from other users wondering how I actually use the built-in tools provided by Dell EqualLogic to protect my environment.  This is one of the reasons why I’ve written so much about ASM/ME, ASM/LE, and SANHQ.  Well, it’s been a while since I’ve touched on any information about ASM/VE, and since I’ve updated my infrastructure to vSphere 5.0 and the HIT/VE 3.1, I thought I’d share a few pointers that have helped me work with this tool in my environment.

The first generation of HIT/VE was really nothing more than a single tool referred to as “Auto-Snapshot Manager / VMware Edition” or ASM/VE.  A lot has changed, as it is now part of a larger suite of VMware-centric tools from EqualLogic called the Host Integration Tools / VMware Edition or HIT/VE.  This consists of the following; EqualLogic Auto-Snapshot Manager, EqualLogic Datastore Manager, and the EqualLogic Virtual Desktop Deployment Utility.  HIT/VE is one of three Host Integration toolsets.  The others being HIT/ME and HIT/LE for Microsoft and Linux respectively.

Ever since HIT/VE 3.0, Dell EqualLogic thankfully transitioned toward and appliance/plug-in model.  This reduced overhead, complexity, and removed some of the quirks with the older implementations.  Because I had been lagging behind in updating vSphere, I was still using 2.x up until recently, and skipped right over 3.0 to 3.1.  Surprisingly, many of the same practices that have served me well with the older version adopt quite well to the new version.

Let me preface that these are just my suggestions off of personal use with all versions of the HIT over the past 3 years.  Just as with any solution, there are a number of different ways to achieve the same result.  The information provided may or may not align with best practices from Dell, or your own practices.  But the tips I provide have stood up to the rigors of a production environment, and have actually worked in real recovery scenarios.  Whatever decisions you make should compliment your larger protection strategies, as this is just one piece of the puzzle.

Tips for Configuring and working with the  HIT/VE appliance

1.  The initial configuration will ask for registration in vCenter (configuration item #8 on the appliance).  You may only register one HIT/VE appliance in vCenter.

2.  The HIT/VE appliance was designed to integrate with vCenter.  But it also offers the flexibility of access.  After the initial configuration, you can verify and modify settings in the respective ASM appliances by browsing directly to their IP address, FQDN, or DNS alias name.  You may type in: http://[applianceFQDN] or for the Auto-Snapshot Manager, type in http://[applianceFQDN]/vmsnaptool.html

3.  Configuration of the storage management network on the appliance is optional, and depending on your topology, may not be needed.

4.  When setting up replication partners, ASM will ask for a “Server URL”  This implies you should enter an “http://” or “https://”  Just enter in the IP address or FQDN without the http:// prefix.  A true URL as it implies will not work.

5.  After you have configured your HIT/VE appliances, run back through and double check the settings.  I had two of them mysteriously reset some DNS configuration during the initial deployment.  It’s been fine since that time.  It might have been my mistake (twice), but it might not.

6. For just regular (local) SmartCopies, create one HIT/VE appliance.  Have the appliance sit in its own small datastore.  Make sure you do not protect this volume via ASM. Dell warns you about this.  For environments where replication needs to occur, set up a second HIT/VE appliance at the remote site.  The same rules apply there.

7.  Log files on the appliance are accessible via Samba.  I didn’t discover this until I was working through the configuration and some issues I was running into.  What a pleasant way to to pull the log data off of the appliance.  Nice work!

Tips for ASM/VE

8.  Just as I learned and recommended in 2.x, the most important suggestion I have to successfully utilizing ASM/VE in your environment is to arrange vCenter folders to represent the contents of your datastores.  Include in the name some indicated of the referencing volume/datastore (seen in the screen capture below, where “103” refers to a datastore called VMFS103.  The reason for this is so that you can keep your smartcopy snapshots straight during creation.  If you don’t do this, when you make a SmartCopy of a folder containing VM’s that reside in multiple datastores, you will see SAN snapshots in each one of those volumes, but they didn’t necessarily capture all of the data correctly.  You will get really confused, and confusion is not what you need when understanding the what and how of recovering systems or data.

image

9.  Schedule or manually create SmartCopy Snapshots by Folder.  Schedule or manually create SmartCopy Replicas by dataStore.  Replicas cannot be created by vCenter Folder.  This strategy has been the most effective for me, but if you didn’t feel like re-arranging your folders in vCenter, you could schedule or manually create SmartCopy Snapshots by datastore as well.

10.  Do not schedule or create Smartcopies by individual machine.  This will get confusing (see above), and may interfere with your planning of snapshot retention periods.  If you want to protect a system against some short term step (e.g. installing service pack, etc.), just use a hypervisor snapshot, and remove when complete.

11.  ASM/VE 2.x was limited to making smart copies of VM’s that had vmdk files all in the same location.  3.x does not have this limitation.  This offers up quite a bit of flexibility if you have VM’s with independent vmdks in other datastores.

12. Test, and document. Create a couple of small volumes, large enough to hold 2 test VM’s in each.  Make a SmartCopy of the VMWare folder where those VM’s reside.  Do a few more SmartCopies, then attempt a restore.  Test.  Add a vmdk in another datastore to one of the VM’s then test again.  This is the best way to not only understand what is going on, but to have no surprises or trepidation when you have to do it for real.  It is especially important to understand how the other VM’s in the same datastore will behave, and how VM’s with multiple vmdks in different datastores will act, as well as what a “restore by rollback” is.  And while you’re add it, make a OneNote or Word document outlining the exact steps for recovery, and what to expect.  Create one for local SmartCopies, and another for remote replicas.  This avoids not thinking clearly under the heat of the moment.  Your goal is to make things better by a restore, not worse. Oh, and if you can’t find the time to document the process, don’t worry, I’m sure the next guy who replaces you will find the time.

13.  Set snapshot and replication retention numbers in ASM/VE.  This much needed feature was added to the 3.0 version.  Tweak each snapshot reserve to a level that you feel comfortable with, and that also matches up against your overall protection policies.  There will be some tuning for each volume so that you can offer the protection times needed, without allocating too much space to snapshot reserves.  ASM will only be able to manage the snapshots that it creates, so if you have some older snaps of a particular datastore, you may need to do a little cleanup work.

14.  Watch the frequency!!!  The only thing worse than not having a backup of a system or data, is to have several bad copies of it, and to realize that the last good one just purged itself out.  A great example of this is something going wrong on a Friday night.  You maybe don’t notice it mid-day on Monday.  But your high frequency SmartCopies only had room for two days worth of changed data.  With ASM/VE, I tend to prefer very modest frequencies.  Once a day is fine with me on many of my systems.  Most of the others that I like to have more frequent SmartCopies of have the actual data on guest attached volumes.  Early on in my use, I had a series of events that were almost disastrous, all because I was overzealous on the frequency, but not mindful enough of the retention.  Don’t be a victim of the ease at cranking up the frequency at the expense of retention.  This is something you’ll never find in a deployment or operations guide, and applies to all aspects of data protection.

15.  If you are creating SmartCopy snapshots and SmartCopy replicas, use your scheduling an opportunity to shrink the window of vulnerability.  Instead of running a replica right after a snapshot each once a day, right after eachother, split the difference so that the replica runs in between the the last SmartCopy snapshot, and the next one.

16.  Keep your SmartCopy and replica frequencies and scheduling as simple as possible.  If you can’t understand it, who will?  Perhaps start with a frequency rate of just once a day for all of your datastores, then go from there.  You might find a frequency such as once a day might work for 99% of your systems.  I’ve found that for most of my data that I need to protect at more frequent intervals, those are on guest attached volumes anyway, and I schedule those up via ASM/ME to meet my needs.

17.  For SmartCopy snapshots, I tend to schedule them so that there is only one job on one datastore at a time.  With the next one scheduled say 5 minutes afterward.  For SmartCopy replicas, if you choose to use free pool space, instead of replica reserve (as I do), you might want to offset those more, so that the replica has time to fully complete in order for the space held by the invisible local replica can be reclaimed for the next job.  Generally this isn’t too much of an issue, unless you are really tight on space.

18.  The SmartCopy defaults have been changed a bit since ASM/VE 2.x. No need to tick any of the checkboxes such as “Perform virtual machine memory dump” and “Set created PS Series snapshots online” In fact, I would untick the “Included PS Series volumes access by guest iSCSI initiators” More info on why below.

19.  ASM/VE still gives you the option to snapshot volumes that are attached to that VM via guest iSCSI initiators.  In general, don’t do it.  Why? If you chose to use this option for Microsoft based VM’s, it would indeed make a snapshot, giving you the impression that all is well, but these would not be coordinated with the internal VSS writer inside the VM, so they are not truly application consistent snapshots of the guest volumes.  Sure, they might work, but they might not.  They may also interfere with your retention policies in ASM/ME.  Do you really want to take that chance with your Exchange or SQL databases, or flat file storage?  If you think flat file storage isn’t important to quiesce, remember that source code control systems like Subversion typically use file systems, and not a database.  It is my position that the only time you should use this option is if you are protecting a Linux VM with guest attached volumes.  Linux has no equivalent to VSS, so you get a free pass on using this option.  However, because this option is a per-job definition, you’ll want to separate Windows based VM’s with guest volumes from Linux based VM’s with guest volumes.  If you wanted to avoid that, you could just rely on on a crash consistent copy of that linux guest attached volume via a scheduled snapshot in the Group Manager GUI.  So the moral of the story is this.  To protect your guest attached volumes in VM’s running Windows, rely entirely on ASM/ME to create a SmartCopy SAN snapshot of your guest attached volumes.

20.  If you need to cherry-pick a file off of a snapshot, or look at an old registry setting, consider restoring or cloning to another volume, and make sure that the restored VM does not have any direct access to the same network that the primary system is running.  Having a special portgroup in vCenter that is just for this purpose works nice.  Many times this method can be the least disruptive to your environment.

21.  I still like to have my DC’s in individual datastores, on their own, and create SmartCopy schedules that do not occur simultaneously.  I found that in practice, our very sensitive automated code compiling system which has dozens (if not hundreds) of active ssh sessions ran into less interference this way compared to when I initially had them in one datastore, or intertwined in datastores with other VMs.  Depending on the number of DCs you have, you might be able to group a couple together, with perhaps splitting off the DC running the PDC emulator role into a separate datastore.  Beware that the SmartCopy for your DC should just be considered as a way to protect the system, not AD.  More info on my post about protecting Active Directory here.

Tips for DataStore Manager

22.  The Datastore Manager in vCenter is one of my favorite new ways to view my PS Group.  Not only do you get a quick check on how my datastores look (limiting the view to just VMFS volumes), but it also shows which volumes have replicas in flight.  It has quickly become one of my most used items in vCenter.

23.  Use the ACL policies feature in Datastore Manager. With the new integration between vCenter and the Group Manager, you can easily create volumes. The ACL policies feature in the HITVE is a way for you to save a predetermined set of ACL’s for your hosts (CHAP, IP, or IQN).  While I personally prefer using IQN’s, any combination of the three will work.  Having an ACL policy is a great way to provision the access to a volume quickly.  If you are using manually configured multi-pathing, it is important to note that creating datastores by this way will using a default pathing of “VMWare fixed.”  You will need to manually change that to “VMWare Round Robin.”  I am told that if you are using the EqualLogic Multi-pathing Extension Module (MEM), that this will be set to the proper setting.  I don’t know that for sure because MEM hasn’t been released for vSphere 5.0 as of this writing.

24.  VMFS5 offers some really great features, but many of them are only available if they were natively created (not upgraded from VMFS3).  If you choose to recreate them by doing a little juggling with Storage vMotion (as I am), remember that this might wreak havoc on your replication, as you will need to re-seed the volumes.  But if you can, you are exposed to many great features of VMFS5.  You might also use this as an opportunity to revisit your datastores and re-arrange if necessary.

25.  If you are going to redo some of your volumes from scratch (to take full advantage of VMFS5), if they are replicated, redo the volumes with the highest change rate first.  They’re already pushing a lot of data through your pipe, so you might as well get them taken care of first.  And who knows, your replicas might be improved with the new volume.

Hopefully this gives you a few good ideas for your own environment.  Thanks for reading.

November 11, 2011

Upgrading to vSphere 5.0 by starting from scratch. …well, sort of.

Filed under: Architecture/planning,SAN,virtualization — ITforMe @ 11:40 am
Tags: ,

It is never any fun getting left behind in IT.  Major upgrades every year or two might not be a big deal if you only had to deal with one piece of software, but take a look at most software inventories, and you’ll see possibly dozens of enterprise level applications and supporting services that all contribute to the chaos.  It can be overwhelming for just one person to handle.  While you may be perfectly justified in holding off on specific upgrades, there still seems to be a bit of guilt around doing so.  You might have ample business and technical factors to support such decisions, and a well crafted message providing clear reasons to stakeholders.  The business and political pressures ultimately win out, and you find yourself addressing the more customer/user facing application upgrades before the behind-the-scenes tools that power it all.

That is pretty much where I stood with my virtualized infrastructure.  My last major upgrade was to vSphere 4.0.  Sure, I had visions of keeping up with every update and patch, but a little time passed, and several hundred distractions later, I found myself left behind.  When vSphere 4.1 came out, I also had every intention of upgrading.  However, I was one of the legions of users who had a vCenter server running on a 32bit OS, and that complicated matters a little bit.  I looked at the various publications and posts on the upgrade paths and experiences.  Nothing seemed quite as easy as I was hoping for, so I did what came easiest to my already packed schedule; nothing.  I wondered just how many Administrators found themselves in the same predicament; not touching an aging, albeit perfectly fine running system.  

My ESX 4.0 cluster served my organization well, but times change, and so do needs.  A few things come up to kick-start the desire to upgrade.

  • I needed to deploy a pilot VDI project, fast.  (more about this in later posts)
  • We were a victim of our own success with virtualization, and I needed to squeeze even more power and efficiency out of our investment in our infrastructure.

Both are pretty good reasons to upgrade, and while I would have loved to do my typical due diligence on every possible option, I needed a fast track.  My move to vSphere 5.0 was really just a prerequisite of sorts to my work with VDI. 

But how should I go about an upgrade?

Do I update my 4.0 hosts to the latest update that would be eligible for an upgrade path to 5.0, and if so, how much work would that be?  Should I transition to a new vCenter server, migrating the database, then run a mixed environment of ESX hosts running with different versions?  What sort of problems would that introduce?  After conferring with a trusted colleague of mine who always seems to have pragmatic sensibilities when it comes to virtualization, I decided which option was going to be the best for me.  I opted not to do any upgrade, and simply transition to a pristine new cluster.  It looked something like this:

  • Take a host (either new, or by removing an existing one from the cluster), and build it up with ESXi 5.0.
  • Build up a new 64bit VM for running a brand new vCenter, and configure as needed.
  • Remove one VM at a time from the old cluster by powering them down, remove from inventory, add to the new cluster.
  • Once enough VM’s have been removed, take another host, remove from the old cluster, rebuild as ESXi 5.0, and add to the new cluster.
  • Repeat until finished.

For me, the decision to start from scratch won out.  Why?

  • I could build up a pristine vCenter server, with a database that wasn’t going to carry over any unwanted artifacts of my previous installation.
  • I could easily set up the new vCenter to emulate my old settings.  Folders, EVC settings, resource pools, etc.
  • I could transition or build up my supporting VM’s or appliances to my new infrastructure to make sure they worked before committing to the transition.
  • I could afford a simple restart of each VM as I transitioned it to a new cluster.  I used this as an opportunity to update the VMware Tools when added to the new inventory.
  • I was willing to give up historical data in my old vSphere 4.0 cluster for the sake of simplicity of the plan and cleanliness of the configuration.
  • Predictability.  I didn’t have to read a single white paper or discussion thread on database migrations or troubles with DSNs.
  • I have a well documented ESX host configuration that is not terribly complex, and easy to recreate across 6 hosts.
  • I just happened to have purchased an additional blade and license of ESX, so it was an ideal time to introduce it to my environment.
  • I could get my entire setup working, then get my licensing figured out after it’s all complete.

You’ll notice that one option similar to this approach would have been to simply remove a host of running VM’s out of the existing cluster, and add it to the new cluster.  This may have been just as good of a plan, as it would have avoided the need to manually shut down and remove each VM one at a time during the transition.  However, I would have needed to run a mix of ESX 4.0 and 5.0 hosts in the new cluster.  I didn’t want to carry anything over from the old setup.  I would have needed to upgrade or rebuild the host anyway, and I had to restart each VM to make sure it was running the latest tools.  If for nothing other than clarity of mind, my approach seemed best for me.

Prior to beginning the transition, I needed to update my Dell EqualLogic firmware to 5.1.2.  A collection of very nice improvements made this a nice upgrade, but a requirement for what I wanted to do.  While the upgrade itself went smoothly, it did re-introduce an issue or two.  The folks at Dell EqualLogic are aware of this, and are working to address it hopefully in their next release.  The combination of the firmware upgrade, and vSphere 5 allowed me to use the latest and greatest tools from EqualLogic, primarily the Host Integration Tools VMWare Edition (HIT/VE) and the storage integration in vSphere thanks to VASA.  Although, as of this writing, EqualLogic does not have a full production release of their MultiPathing Extension Module (MEM) for vSphere 5.0.  The EPA version was just released, but I’ll probably wait for the full release of MEM to come out before I apply it to the hosts in the cluster.

While I was eager to finish the transition, I didn’t want to prematurely create any problems.  I took a page from my own lessons learned during my upgrade to ESX 4.0, and exercised some restraint when it came to updating my Virtual Hardware for each VM to version 8.  My last update of Virtual Hardware levels in each VM caused some unexpected results, as I shared in “Side effects of upgrading VM’s to Virtual Hardware 7 in vSphere”   Apparently, I wasn’t the only one who ran into issues, because that post has statistically been my all time most popular post.  The abilities of Virtual Hardware 8 powered VMs are pretty neat, but I’m in no rush to make any virtual hardware changes to some of my key production systems, especially those noted. 

So, how did it work out?  The actual process completed without a single major hang-up, and am thrilled with the result.  The irony here is that even though vSphere provides most of the intelligence behind my entire infrastructure, and does things that are mind bogglingly cool, it was so much easier to upgrade than say, SharePoint, AD, Exchange, or some other enterprise software.  Great technologies are great because they work like you think they should.  No exception here.  If you are considering a move to vSphere 5.0, and are a little behind on your old infrastructure, this upgrade approach might be worth considering.

Now, onto that little VDI project…

Resources

A great resource on setting up SQL 2008 R2 for vCenter
How to Install Microsoft SQL Server 2008 R2 for VMware vCenter 5

Installing vCenter 5 Best Practices
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2003790

A little VMFS 5.0 info
http://www.yellow-bricks.com/2011/07/13/vsphere-5-0-what-has-changed-for-vmfs/

Information on the EqualLogic Multipathing Extension Module (MEM), and if you are an EqualLogic customer, why you should care.
https://whiteboardninja.wordpress.com/2011/02/01/equallogic-mem-and-vstorage-apis/

September 6, 2011

Using the Dell EqualLogic HIT for Linux

Filed under: SAN,virtualization — ITforMe @ 2:35 am
Tags: , , , ,

 

I’ve been a big fan of Dell EqualLogic Host Integration Tools for Microsoft (HIT/ME), so I was looking forward to seeing how the newly released HIT for Linux (HIT/LE) was going to pan out.  The HIT/ME and HIT/LE offer unique features when using guest attached volumes in your VM’s.  What’s the big deal about guest attached volumes?  Well, here is why I like them.

  • It keeps the footprint of the VM really small.  The VM can easily fit in your standard VMFS volumes.
  • Portable/replaceable.  Often times, systems serving up large volumes of unstructured data are hard to update.  Having the data as guest attached means that you can easily prepare a new VM presenting the data (via NFS, Samba, etc.), and cut it over without anyone knowing – especially when you are using DNS aliasing.
  • Easy and fast data recovery.  My “in the trenches” experience with the guest attached volumes in VM’s running Microsoft OS’s (and EqualLogic’s HIT/ME) have proven that recovering data off of guest attached volumes is just easier – whether you recover it from snapshot or replica, clone it for analysis, etc. 
  • Better visibility of performance. Thanks to the independent volume(s), one can easily see with SANHQ what the requirements of that data volume is. 
  • More flexible protection.  With guest attached volumes, it’s easy to crank up the frequency of snapshot and replica protection on just the data, without interfering with the VM that is serving up the data.
  • Efficient, tunable MPIO. 
  • Better utilization of space.  If you wanted to serve up a 2TB volume of storage using a VMDK, more than likely you’d have a 2TB VMFS volume, and something like a 1.6TB VMDK file to accommodate hypervisor snapshots.  With a native volume, you would be able to use the entire 2TB of space. 

The one “gotcha” about guest attached volumes is that they aren’t visible by the vCenter API, so commercial backup applications that rely on the visibility of these volumes via vCenter won’t be able to back them up.  If you use these commercial applications for protection, you may want to determine if guest attached volumes are a good fit, and if so, find alternate ways of protecting the volumes.    Others might contend that because the volumes aren’t seen by vCenter, one is making things more complex, not less.  I understand the reason for thinking this way, but my experience with them have proven quite the contrary.

Motive
I wasn’t trying out the HIT/LE because I ran out of things to do.  I needed it to solve a problem.  I had to serve up a large amount (several Terabytes) of flat file storage for our Software Development Team.  In fact, this was just the first of several large pools of storage that I need to serve up.  It would have been simple enough to deploy a typical VM with a second large VMDK, but managing such an arrangement would be more difficult.  If you are ever contemplating deployment decisions, remember that simplicity and flexibility of management should trump simplicity of deployment if it’s a close call.  Guest attached volumes align well with the “design as if you know it’s going to change” concept.  I knew from my experience with working with guest attached volumes for Windows VM’s, that they were very agile, and offered a tremendous amount of flexibility.

But wait… you might be asking, “If I’m doing nothing but presenting large amounts of raw storage, why not skip all of this and use Dell’s new EqualLogic FS7500 Multi-Protocol NAS solution?”  Great question!  I had the opportunity to see the FS7500 NAS head unit at this year’s Dell Storage Forum.  The FS7500 turns the EqualLogic block based storage accessible only on your SAN network into CIFS/NFS storage presentable to your LAN.  It is impressive.  It is also expensive.  Right now, using VM’s to present storage data is the solution that fits within my budget.  There are some downfalls (Samba not supporting SMB2), but for the most part, it falls in the “good enough” category.

I had visions of this post focusing on the performance tweaks and the unique abilities of the HIT/LE.  After implementing it, I was reminded that it was indeed a 1.0  product.  There were enough gaps in deployment information that I felt it necessary to provide information on exactly how I actually made the HIT for Linux work.  IT Generalists who I suspect make up a significant amount of the Dell EqualLogic customer base have learned to appreciate their philosophy of “if you can’t make it easy, don’t add the feature.”   Not everything can be made intuitive however, especially the first time around.

Deployment Assumptions 
The scenario and instructions are for a single VM that will be used to serve up a single large volume for storage. It could serve up many guest attached volumes, but for the sake of simplicity, we’ll just be connecting to a single volume.

  • VM with 3 total vNICs.  One used for LAN traffic, and the other two, used exclusively for SAN traffic.  The vNIC’s for the SAN will be assigned to the proper vswitch and portgroup, and will have static IP addresses.  The VM name in this example is “testvm”
  • A single data volume in your EqualLogic PS group, with an ACL that allows for the guest VM to connect to the volume using CHAP, IQN, or IP addresses.  (It may be easiest to first restrict it by IP address, as you won’t be able to determine your IQN until the HIT is installed).  The native volume name in this example is “nfs001” and the group IP address is 10.1.0.10
  • Guest attached volume will be automatically connected at boot, and will be accessible via NFS export.  In this example I will be configuring the system so that the volume is available via the “/data1” directory.
  • OS used will be RedHat Enterprise Linux (RHEL) 5.5. 
  • EqualLogic’s HIT 1.0

Each step below that starts with word “VERIFICATION” is not a necessary step, but it helps you understand the process, and will validate your findings.  For brevity, I’ve omitted some of the output of these commands.

Deploying and configuring the HIT for Linux
Here we go…

Prepping for Installation

1.     Verify installation of EqualLogic prerequisites (via rpm -q [pkgname]).  If not installed, run yum install [pkgname]

openssl                    (0.9.8e for RHEL 5.5)

libpcap                    (0.9.4 for RHEL 5.5)

iscsi-initiator-utils      (6.2.0.871 for RHEL 5.5)

device-mapper-multipath    (0.4.7 for RHEL 5.5)

python                                          (2.4 for RHEL 5.5.) 

dkms                       (1.9.5 for RHEL 5.5)

 

(dkms is not part of RedHat repo.  Need to download from http://linux.dell.com/dkms/ or via the "Extra Packages for LInux" epel repository.  I chose Dell website location because it was a newer version.  Simply download and execute RPM.). 

 

2.     Snapshot Linux machine so that if things go terribly wrong, it can be reversed

 

3.     Shutdown VM, and add NIC’s for guest access

Make sure to choose iSCSI network when adding to VM configuration

After startup, manually specify Static IP addresses and subnet mask for both.  (No default gateway!)

Activate NIC’s, and reboot

 

4.     Power up, then add the following lines to /etc/sysctl.conf  (for RHEL 5.5)

net.ipv4.conf.all.arp_ignore = 1

net.ipv4.conf.all.arp_announce = 2

 

5.     Establish NFS and related daemons to automatically boot

chkconfig portmap on

chkconfig nfs on

chkconfig nfslock on

 

6.     Establish directory which will ultimately be used to export for mounting.  In this example, the iSCSI device will mount to a directory called “eql2tbnfs001” in the /mnt directory. 

mkdir /mnt/eql2tbnfs001

 

7.     Make symbolic link called “data1” in the root of the file system.

ln -s /mnt/eql2tbnfs001 /data1 

 

Installation and configuration of the HIT

8.     Verify that the latest HIT Kit for Linux is being used for installation.  (V1.0.0 as of 9/2011)

 

9.     Import public key

      Download the public key from eql support site under HIT for Linux, and place in /tmp/ )

Add key:

rpm –import RPM-GPG-KEY-DELLEQL (docs show lower case, but file is upper case)

 

10.  Run installation

yum localinstall equallogic-host-tools-1.0.0-1.e15.x86_64.rpm

 

Note:  After HIT is installed, you may get the IQN for use of restricting volume access in the EqualLogic group manager by typing the following:

cat /etc/iscsi/initiatorname.iscsi.

 

11.  Run eqltune (verbose).  (Tip.  You may want to capture results to file for future reference and analysis)

            eqltune -v

 

12.  Make adjustments based on eqltune results.  (Items listed below were mine.  Yours may be different)

 

            NIC Settings

   Flow Control. 

ethtool -A eth1 autoneg off rx on tx on

ethtool -A eth2 autoneg off rx on tx on

 

(add the above lines to /etc/rc.d/rc.local to make persistent)

 

There may be a suggestion to use jumbo frames by increasing the MTU size from 1500 to 9000.  This has been omitted from the instructions, as it requires proper configuration of jumbos from end to end.  If you are uncertain, keep standard frames for the initial deployment.

 

   iSCSI Settings

   (make backup of /etc/iscsi/iscid.conf before changes)

 

      Change node.startup to manual.

   node.startup = manual

 

      Change FastAbort to the following:

   node.session.iscsi.FastAbort = No

 

      Change initial_login_retry to the following:

   node.session.initial_login_retry_max = 12

 

      Change number of queued iSCSI commands per session

   node.session.cmds_max = 1024

 

      Change device queue depth

   node.session.queue_depth = 128

 

13.  Re-run Eqltune -v to see if changes took affect

All changes took effect, minus the NIC settings added to the rc.local file.  Looks to be a syntax error from Eql documentation provided.  It has been corrected in the documentation above.

 

14.  Run command to view and modify MPIO settings

rswcli –mpio-parameters

 

This returns the results of:  (seems to be good for now)

Processing mpio-parameters command…

MPIO Parameters:

Max sessions per volume slice:: 2

Max sessions per entire volume:: 6

Minimum adapter speed:: 1000

Default load balancing policy configuration: Round Robin (RR)

IOs Per Path: 16

Use MPIO for snapshots: Yes

Internet Protocol: IPv4

The mpio-parameters command succeeded.

 

15.  Restrict MPIO to just the SAN interfaces

Exclude LAN traffic

            rswcli -E -network 192.168.0.0 -mask 255.255.255.0

 

VERIFICATION:  List status of includes/excludes to verify changes

            rswcli –L

 

VERIFICATION:  Verify Host connection Mgr is managing just two interfaces

      ehcmcli –d

 

16.  Discover targets

iscsiadm -m discovery -t st -p 10.1.0.10

(Make sure no unexpected volumes connect.  But note the IQN name presented.  You’ll need it for later.)

 

VERIFICATION:  shows iface

[root@testvm ~]# iscsiadm -m iface | sort

default tcp,<empty>,<empty>,<empty>,<empty>

eql.eth1_0 tcp,00:50:56:8B:1F:71,<empty>,<empty>,<empty>

eql.eth1_1 tcp,00:50:56:8B:1F:71,<empty>,<empty>,<empty>

eql.eth2_0 tcp,00:50:56:8B:57:97,<empty>,<empty>,<empty>

eql.eth2_1 tcp,00:50:56:8B:57:97,<empty>,<empty>,<empty>

iser iser,<empty>,<empty>,<empty>,<empty>

 

VERIFICATION:  Check connection sessions via iscsiadm -m session to show that no connections exist

[root@testvm ~]# iscsiadm -m session

iscsiadm: No active sessions.

 

VERIFICATION:  Check connection sessions via /dev/mapper to show that no connections exist

[root@testvm ~]# ls -la /dev/mapper

total 0

drwxr-xr-x  2 root root     60 Aug 26 09:59 .

drwxr-xr-x 10 root root   3740 Aug 26 10:01 ..

crw——-  1 root root 10, 63 Aug 26 09:59 control

 

VERIFICATION:  Check connection sessions via ehcmcli -d to show that no connections exist

[root@testvm ~]# ehcmcli -d

 

17.  Login just one of the iface paths of your liking (shown in red here).  Replace the IQN here (shown in green) with yours. The HIT will take care of the rest.

iscsiadm -m node -T iqn.2001-05.com.equallogic:0-8a0906-451da1609-2660013c7c34e45d-nfs001 -I eql.eth1_0 -l

 

This returned:

[root@testvm ~]# iscsiadm -m node -T iqn.2001-05.com.equallogic:0-8a0906-451da1609-2660013c7c34e45d-nfs001 -I eql.eth1_0 -l

Logging in to [iface: eql.eth1_0, target: iqn.2001-05.com.equallogic:0-8a0906-451da1609-2660013c7c34e45d-nfs001, portal: 10.1.0.10,3260]

Login to [iface: eql.eth1_0, target: iqn.2001-05.com.equallogic:0-8a0906-451da1609-2660013c7c34e45d-nfs001, portal: 10.1.0.10,3260] successful.

 

VERIFICATION:  Check connection sessions via iscsiadm -m session

[root@testvm ~]# iscsiadm -m session

tcp: [1] 10.1.0.10:3260,1 iqn.2001-05.com.equallogic:0-8a0906-451da1609-2660013c7c34e45d-nfs001

tcp: [2] 10.1.0.10:3260,1 iqn.2001-05.com.equallogic:0-8a0906-451da1609-2660013c7c34e45d-nfs001

 

VERIFICATION:  Check connection sessions via /dev/mapper.  This is going to give you the string you will need to use making and mounting the filesystem.

[root@testvm ~]# ls -la /dev/mapper

 

 

VERIFICATION:  Check connection sessions via ehcmcli -d

[root@testvm ~]# ehcmcli -d

 

18.  Make new file system from the dm-switch name.  Replace the IQN here (shown in green) with yours.  If this is an existing volume that has been used before (from a snapshot, or another machine) there is no need to perform this step.  Documentation will show this step without the “-j” switch, which will format it as a non-journaled ext2 file system.  The –j switch will format it as an ext3 file system.

mke2fs -j -v /dev/mapper/eql-0-8a0906-451da1609-2660013c7c34e45d-nfs001

 

19.  Mount the device to a directory

[root@testvm mnt]# mount /dev/mapper/eql-0-8a0906-451da1609-2660013c7c34e45d-nfs001 /mnt/eql2tbnfs001

 

20.  Establish iSCSI connection automatically

[root@testvm ~]# iscsiadm -m node -T iqn.2001-05.com.equallogic:0-8a0906-451da1609-2660013c7c34e45d-nfs001 -I eql.eth1_0 -o update -n node.startup -v automatic

 

21.  Mount volume automatically

Change /etc/fstab, adding the following:

/dev/mapper/eql-0-8a0906-451da1609-2660013c7c34e45d-nfs001 /mnt/eql2tbnfs001 ext3 _netdev  0 0

Restart system to verify automatic connection and mounting.

 

Working with guest attached volumes
After you have things configured and operational, you’ll see how flexible guest iSCSI volumes are to work with.

  • Do you want to temporarily mount a snapshot to this same VM or another VM? Just turn the snapshot online, and make a connection inside the VM.
  • Do you need to archive your data volume to tape, but do not want to interfere with your production system? Mount a recent snapshot of the volume to another system, and perform the backup there.
  • Do you want to do a major update to that front end server presenting the data? Just build up a new VM, connect the new VM to that existing data volume, and change your DNS aliasing, (which you really should be using) and you’re done.
  • Do you need to analyze the I/O of the guest attached volumes? Just use SANHQ. You can easily see if that data should be living on some super fast pool of SAS drives, or a pool of PS4000e arrays.  You’ll be able to make better purchasing decisions because of this.

So, how did it measure up?

The good…
Right out of the gate, I noticed a few really great things about the HIT for Linux.

  • The prerequisites and installation.  No compiling or other unnecessary steps.  The installation package installed clean with no fuss.  That doesn’t happen every day.
  • Eqltune.  This little utility is magic.  Talk about reducing overhead in preparing a system for MPIO and all things related to guest based iSCSI volumes.  It gave me a complete set of adjustments to make, divided into 3 simple categories.  After I made the adjustments, I re-ran the utility, everything checked out okay.  Actually, all of the command line tools were extremely helpful.  Bravo!
  • One really impressive trait of the HIT/LE is how it handles the iSCSI sessions for you. Session build up and teardown is all taken care of by the HIT for Linux.

The not so good…
Almost as fast as the good shows up, you’ll notice a few limitations

  • Version 1.0 is only officially supported on RedHat Enterprise Linux (RHEL) 5.5 and 6.0 (no 6.1 as of this writing).  This might be news to Dell, but Debian based systems like Ubuntu are running in enterprises everywhere for it’s cost, solid behavior, and minimalist approach.  RedHat clones dominate much of the market; some commercial, and some free.  Personally, upstream Distributions such as Fedora are sketchy, and prone to breakage with each release (Note to Dell, I don’t blame you for not supporting these.  I wouldn’t either).  Other distributions are quirky for their own reasons of “improvement” and I can understand why these weren’t initially supported either.  A safer approach for Dell (and the more flexible approach for the customer) would be to 1.) Get out a version for Ubuntu as fast as possible, and 2.)  Extend the support of this version to RedHat’s, downstream, 100% binary compatible, very conservative distribution, CentOS.  For you Linux newbies, think of CentOS as being the RedHat installation but with the proprietary components stripped out, and nothing else added.  While my first production Linux server running the HIT is RedHat 5.5, all of my testing and early deployment occurred on a CentOS 5.5 Distribution, and it worked perfectly. 
  • No AutoSnapshot Manager (ASM) or equivalent.  I rely on ASM/ME on my Windows VM’s with guest attached volumes to provide me with a few key capabilities.  1.)  A mechanism to protect the volumes via snaphots and replicas.  2.)  Coordinating applications and I/O so that I/O is flushed properly.  Now, Linux does not have any built-in facility like Microsoft’s Volume Shadow Copy Services (VSS), so Dell can’t do much about that.  But perhaps some simple script templates might give the users ideas on how to flush and pause I/O of the guest attached volumes for snapshots.  Just having a utility to create Smart copies or mount them would be pretty nice. 

The forgotten…
A few things overlooked?  Yep.

  • I was initially encouraged by the looks of the documentation.  However, In order to come up with the above, I had to piece together information from a number of different resources.   Syntax and capitalization errors will kill you in a Linux shell environment.  Some of those inconsistencies and omissions showed up.  With a little triangulation, I was able to get things running correctly, but it quickly became a frustrating, time consuming exercise that I felt like I’ve been through before.  Hopefully the information provided here will help.
  • Somewhat related to the documentation issue is something that has come up with a few of the other EqualLogic tools;  Customers often don’t understand WHY one might want to use the tool.  Same thing goes with the HIT for Linux.  Nobody even gets to the “how” if they don’t understand the “why”.  But, I’m encouraged by the great work the Dell TechCenter has been doing with their white papers and videos.  It has become a great source for current information, and are moving in the right direction of customer education.   

Summary
I’m generally encouraged by what I see, and am hoping that Dell EqualLogic takes on the design queues of the HIT/ME to employ features like AutoSnapshot Manager, and an equivalent to eqlxcp (EqualLogic’s offloaded file copy command in Windows).  The HIT for Linux  helped me achieve exactly what I was trying to accomplish.  The foundation for another easy to use tool in the EqualLogic line up is certainly there, and I’m looking forward to how this can improve.

Helpful resources
Configuring and Deploying the Dell EqualLogic Host Integration Toolkit for Linux
http://en.community.dell.com/dell-groups/dtcmedia/m/mediagallery/19861419/download.aspx

Host Integration Tools for Linux – Installation and User Guide
https://www.equallogic.com/support/download_file.aspx?id=1046 (login required)

Getting more IOPS on workloads running RHEL and EQL HIT for Linux
http://en.community.dell.com/dell-blogs/enterprise/b/tech-center/archive/2011/08/17/getting-more-iops-on-your-oracle-workloads-running-on-red-hat-enterprise-linux-and-dell-equallogic-with-eql-hitkit.aspx 

RHEL5.x iSCSI configuration (Not originally authored by Dell, nor specific to EqualLogic)
http://www.equallogic.com/resourcecenter/assetview.aspx?id=8727 

User’s experience trying to use the HIT on RHEL 6.1, along with some other follies
http://www.linux.com/community/blogs/configuring-dell-equallogic-ps6500-array-to-work-with-redhat-linux-6-el.html 

Dell TechCenter website
http://DellTechCenter.com/ 

Dell TechCenter twitter handle
@DellTechCenter

June 26, 2011

Reworking my PowerConnect 6200 switches for my iSCSI SAN

Filed under: virtualization — ITforMe @ 11:53 pm
Tags: , , , , ,

It sure is easy these days to get spoiled with the flexibility of virtualization and shared storage.  Optimization, maintenance, fail-over, and other adjustments are so much easier than they used to be.  However, there is an occasional reminder that some things are still difficult to change.  For me, that reminder was my switches I use for my SAN.

One of the many themes I kept hearing at this year’s Dell Storage Forum (a great experience I must say) throughout several of the breakout sessions I went to was “get your SAN switches configured correctly.”  A nice reminder to something I was all too aware of already; my Dell PowerConnect 6224 switches were not configured correctly since the day they replaced my slightly less capable (but rock solid) PowerConnect 5424’s.  I returned from the forum committed to getting my switchgear updated and configured the correct way.  Now for the tough parts…  What does “correct” really mean when it comes to the 6200 series switches?  And why didn’t I take care of this a long time ago?  Here are just a few excuses reasons. 

  • At the time of initial deployment, I had difficulty tracking down documentation written specifically for the 6224’s to be configured with iSCSI.  Eventually, I did my best to interpret the configuration settings of the 5424’s, and apply the same principals to the 6224’s.  Unfortunately, the 6224’s are a different animal than the 5424’s, and that showed up after I placed them into production – a task that I regretfully rushed.
  • When I deployed them into production, the current firmware was the 2.x generation.  It was my understanding after the deployment that the 2.x firmware on the 6200 series definitely had growing pains.  I also had the unfortunate timing that the next major revision came out shortly after I put them into production.
  • I had two stacked 6224 switches running my production SAN environment (a setup that was quite common for those I asked at the Dell Storage Forum). While experimenting with settings might be fun in a lab, it is no fun, and serious business when they are running a production environment. I wanted to make adjustments just once, but had difficulty confirming settings.
  • When firmware needs to be updated (a conclusion to an issue I was reporting to Technical Support), it is going to take down the entire stack.  This means that you’d better have everything that uses the SAN off unless you like living dangerously.  Major firmware updates will also require the boot code in each switch to be updated.  A true “lights out” maintenance window that required everything to be shut down.  The humble little 5424’s LAGd together didn’t have that problem.
  • The 2.x to 3.x firmware update also required the boot code to be updated.  However, you simply couldn’t run an “update bootcode” command.  The documentation made this very clear.  The PowerConnect Technical Support Team indicated that the two versions ran different algorithms to unpack the contents, which was the reason for yet another exception to the upgrade process. 

One of the many best practices recommended at the Forum was to stack the switches instead of LAGing them.  Stack, stack, stack was drilled into everyone’s head.  The reasons are very good, and make a lot of sense.

  • Stacking modules in many ways extend the circuiting of a single switch, thus the stacking module doesn’t have to honor or be limited by traditional Ethernet.
  • Managing one switch manages them all.
  • Better, more scalable bandwidth between switches
  • No messing around with LAG’s

But here lays the conundrum of many Administrators who are responsible for production environments.  While stacked 6224’s offer redundancy against hardware failure, they offer no redundancy when it comes to maintenance.  These stacked switches are seen as one logical unit, and may be your weakest link when it comes to maintenance of your virtualized infrastructure.  Interestingly enough, when inquiring further on effective strategies for updating under this topology, I observed a few things;  many other users who were stuck with this very same dilemma, and the answers provided weren’t too exciting.  There were generally three answers I heard from this design decision:

  • Plan for a “lights out” maintenance window.
  • Buy another set of two switches, stack those, then trunk the two together via 10Gbe,
  • Buy better switches. 

See why I wasn’t too excited about my options?

Decision time.  I knew I’d suffer a bit of downtime updating the firmware and revamping the configuration no matter what I did.  Do I stack them as recommended, only to be faced with the same dilemma on the next firmware upgrade?  Or do I LAG the switches together so that I avoid this upgrade fiasco in the future?  LAG’ing is not perfect either, and the more arrays I add (as well as the inter-array traffic increasing with new array features), the more it might compound some of the limitations of LAGs. 

What option won out?  I decided to give stacking ONE more try.  I had to keep the eye on my primary objective; correcting my configuration by way of firmware upgrade and build up a simple, pristine configuration from scratch.  The idea was that the configuration would initially contain the minimum set of modifications to get them working according to best practices.  Then, I could build off of the configuration in the future.  Also influencing my decision was finding out that recommended settings with LAGs apparently change frequently.  For instance, just recently, the recommended setting for flow control for the port channel in a LAG was changed.  These are the types of things I wanted to stay away from.  But with that said, I will continue to keep the option open to LAGing them, for the sole reason that it offers the flexibility for maintenance without shutting down your entire cluster.

So here was my minimum desired results for the switch stack after the upgrade and reconfiguration.  Pretty straight forward. 

  • Management traffic on another VLAN (VLAN 10) on port 1 (for uplinking) and port 2 (for local access).
  • iSCSI traffic on it’s own VLAN (VLAN 100), on all ports not including the management ports.
  • Essentially no traffic on the Default VLAN
  • Recommended global and port specific settings (flow control, spanning tree, jumbo frames, etc.) for iSCSI traffic endpoint connections
  • iSCSI traffic that was available to be routed through my firewall (for replication).

My configuration rework assumed the successful boot code and firmware upgrade to version 3.2.1.3.  I pondered a few different ways to speed this process up, but ultimately just followed the very good steps provided with the documentation for the firmware.  They were clear, and accurate.

By the way, on June 20th, 2011, Dell released their very latest firmware update (thank you RSS feed) to 3.2.1.3 A23.  This now includes their “Auto Detection” of ports for iSCSI traffic.  Even though the name implies a feature that might be helpful, the documentation did not provide enough information needed, and I decided to manually configure as originally planned.

For those who might be in the same boat as I was, here were the exact steps I did for building up a pristine configuration after updating the firmware and boot code.  The configuration below was definitely a combined effort by the folks from the EqualLogic and PowerConnect Teams, and me pouring over a healthy amount of documentation.  It was my hope that this combined effort would eliminate some of the contradictory information I found in previous best practices articles, forum threads, and KB articles that assumed earlier firmware.  I’d like to thank them for being tolerant of my attention to detail, and to get this right the first time.  You’ll see that the rebuild steps are very simple.  Getting confirmation on this was not.

Step 1:  Reset the switch to defaults (make a backup of your old config, just in case)
enable
delete startup-config
reload

 
Step 2:  When prompted, follow the setup wizard in order to establish your management IP, etc. 
 
Step 3:  Put the switch into admin and configuration mode.
enable
configure

 
Step 4:  Establish Management Settings
hostname [yourstackhostname]
enable password [yourenablepassword]
spanning-tree mode rstp
flowcontrol

 
Step 5: Add the appropriate VLAN IDs to the database and setup interfaces.
vlan database
vlan 10
vlan 100
exit
interface vlan 1
exit
interface vlan 10
name Management
exit
interface vlan 100
name iSCSI
exit
ip address vlan 10
 
Step 6: Create an Etherchannel Group for Management Uplink
interface port-channel 1
switchport mode access
switchport access vlan 10
exit
NOTE: Because the switches are stacked, port one on each switch will be configured in this channel-group which can then be connected to their core switch or intermediate switch for management access. Port two on each switch can be used if they need to plug a laptop into the management VLAN, etc.
 
Step 7: Configure/assign Port 1 as part of the management channel-group:
interface ethernet 1/g1
switchport access vlan 10
channel-group 1 mode auto
exit
interface ethernet 2/g1
switchport access vlan 10
channel-group 1 mode auto
exit
 
Step 8: Configure Port 2 as Management Access Switchports (not part of the channel-group):
interface ethernet 1/g2
switchport access vlan 10
exit
interface ethernet 2/g2
switchport access vlan 10
exit
 
Step 9: Configure Ports 3-24 as iSCSI access Switchports
interface range ethernet 1/g3-1/g24
switchport access vlan 100
no storm-control unicast
spanning-tree portfast
mtu 9216
exit
interface range ethernet 2/g3-2/g24
switchport access vlan 100
no storm-control unicast
spanning-tree portfast
mtu 9216
exit
NOTE:  Binding the xg1 and xg2 interfaces into a port-channel is not required for stacking. 
 
Step 10: Exit from Configuration Mode
exit
 
Step 11: Save the configuration!
copy running-config startup-config

Step 12: Back up the configuration
console#copy startup-config tftp://[yourTFTPip]/conf.cfg

In hindsight, the most time consuming aspect of all of this was trying to confirm the exact settings for the 6224’s in an iSCSI SAN.  Running in second was shutting down all of my VMs, ESX hosts, and anything else that connected to the SAN switchgear.  The upgrade and the rebuild was relatively quick and trouble-free.  I’m thrilled to have this behind me now, and I hope that by passing this information along, you too will have a very simple working example to build your configuration off of.  As for the 6224’s, they are working fine now.  I will continue to keep my fingers crossed that Dell will eventually provide a way to update firmware to a stacked set of 6200 series switches without a lights out maintenance window.

May 17, 2011

Software that helps make life in IT a little easier

Filed under: Applications — ITforMe @ 3:32 am
Tags: , , , ,

 

In IT, rarely is one truly developing something from the ground up.  In many ways, IT is about making solutions work – disjointed as they may be.  Large enterprise class solutions such as Email and messaging platforms, Content Management Systems, CRM’s, Directory Services, and Security Solutions all are massively complex -  even if they are well designed.  Those of us who are faced with the responsibility to “make it work” must possess the knack to be a deep-dive expert on any number of subjects, while having the big picture perspective of the IT Generalist.  It can be a complex mix of factors that determine how well solutions end up working out.  It’s usually an assorted mix of experience, technical and organizational skillsets, ingenuity, a lot of hard work, and a little bit of luck.  This is how the seasoned IT veteran separates themselves from those less experienced. 

Then, every once in a while a piece of software comes along to make your life in IT easier.  Software that helps bridge the much needed gaps that may exist in cross platform integration, connectivity, management, monitoring, or procedural tasks.  These are applications that don’t make deploying or managing complex systems easy.  They just make it a little easier.  Sometimes you stumble upon helpful applications like these almost by accident, as I have.  Others you knew of, but just never got around to trying out.  So I thought I’d take a brief time-out from my recent focus on all things related to Virtualization, and take a moment to share a few of those applications that are currently making my life in IT a little easier.  Some of these listed below are worthy of their own posts, which I hope to get around to.  It is a list that is neither complete, nor appropriate for every environment, and their importance really depends on how much you need it.  Only time will tell on which solutions become obsolete, and which one’s stand up over time.

Scribe Insight
This may be the best product you’ve never heard of.  If you ever need to transform, manipulate, or convert data from disparate systems, this is the product for you.  No, it’s not a “utility” but an enterprise class solution that demands a commitment in time to learn.  The results are stunning.  Data sources that had no earthly intention of being able to talk to another system can share the same data.   Example:  Your Sales Department uses a CRM running on SQL, but an ERP or Finance system runs on Oracle, and you need those records to interact on a transaction by transaction basis.  Scribe can do that, and much more.  Are those systems running on separate networks?  No problem.  Scribe simplifies the communication channels between autonomous systems.  It can insulate the complexity of convoluted database tables, and in some cases will completely eliminate the need for you to use an application’s SDK for data integration.  Database Administrators would love this tool, but it’s power extends well beyond just database integration.  It’s a true gem.

Tree Size Pro
You have a choice. Spend weeks and weeks trying to get PowerShell or vb scripts to analyze and manipulate your large flat-file storage contents, or spend a few bucks for Tree Size Pro.  This product delivers.  I’ve used it to generate reports on storage usage, and to automate flat file storage cleanup tasks.  When I think about what it would have taken to do it programmatically, I’d still be working on it.

OneNote
I’ve written about OneNote before, and how it can be utilized in IT.  Since that time, I’ve learned how to exploit it even more, and it goes with me everywhere.  It could be 10 times the price it is, and I’d still pay for it myself if needed.  It’s the pocket knife that should be in every Administrator’s tool chest.  The larger your team, the better it works.  Design documentation, troubleshooting active issues, project planning, research, etc.  It will help you become a better Administrator. 

Likewise 
This software allows for Unix, Linux, and Mac systems to authenticate against Active Directory.  It will allow for centralized management of these systems using Group Policy Objects in the same way you manage your windows machines.  I was one of their first customers, and have been thrilled to see it mature over the years.  Their Open Source edition is OEM integrated into Linux Distributions such as Ubuntu, Suse, and other products like VMware vSphere.  The free/Open source edition allows for you to join these systems to AD, while the commercial edition allows for centralized management.

Putty
If you need a solid windows based SSH client to connect to your Linux clients, this is it.  One version (.56b) also supports the “Generic Security Services API” or GSSAPI.  This means that if your Linux machines are domain joined using Likewise, you can leverage Active Directory to log in to that Linux system, inheriting your credentials so that it is all passwordless.  Included with it is “plink” which gives you the ability to run a *nix command remotely from the windows system.  Great for routines initiated from a windows workstation.  “Pscp” is the putty SCP client for getting files to and from that connected *nix system.

CionSystems AD Change Notifier
One of the interesting aspects of Active Directory is that there are object changes all the time, but as an Administrator, you have no way of knowing it. AD Change Notifier helps with that.  Simple, yet effective.  It sends you an email notification of object changes in AD.  You can select whether you want all types of changes (modifies, creates, deletes), as well as particular object types (users, machines, OUs, GPOs, etc.). You learn a little about how objects change in AD, and if you delegate AD responsibility, how and what is being changed in AD.

Wyse Pocket Cloud for the iPhone and iPad
Not unique in its purpose, but this RDP (and optionally PCoIP) client for the iPhone and iPad does what its supposed to do flawlessly.  Any app that can let you reboot a critical server from the golf course is good in my book.  Any app that lets you do that on the golf course, in front of the VP of the company is even better. (True story)

Acronis
Long before the wonders of virtualization, there were byte-level disk imaging solutions to help you with your system protection and recovery needs.  This was like magic at the time, especially as it was becoming obvious file based backups of system partitions were never any good in the first place.  While it may not be needed in the Enterprise like it once was, there are still a few good use cases for it.  It’s also pretty handy to have on your home system, and every one of your neighbors home systems.  …Or the ones that know you’re in IT, and think you are their personalized technical support. 

CionSystems AD Self Service
Yet another tool from CionSystems.  It takes the burden off of IT for user account related activities.  Does the user need to change their cell phone number or their home address?  Does a Department Manager need to change the Title of someone’s position?  AD Self Service can do this, without ever giving these end users privileges.  Updating AD related attributes is especially important if you use other solutions that leverage AD information (Exchange, SharePoint, CRM, etc.).  AD Self Service also allows for a secure way for the user to unlock their locked out account.  The more users you manage, the more this product will help take the burden off of IT.

SolarWinds Subnet Calculator
Some networking purists would flog me on the side of the head for recommending such a cheater app.  But the fact is, I need quick and easy way to review subnetting options in order to make the right decision.  I can subnet manually much like I can do arithmetic manually.  I just choose not to.  I have other projects to allocate my time to, and I need the speed of a calculator to help me visit those options more quickly.  Subnet calculators like SolarWinds offer one other ability often overlooked; the ability to visualize the sizing of your subnetting.  You can create problems by making subnets too small, or too large.  Tools like this give a great visual representation of how you want to split networks.  It doesn’t excuse the requirement that every Administrator should fully understand how subnetting works.  (I still marvel at how brilliant IP subnetting is).  It’s that once they do, an Administrator should be able to use a tool to make it easier and faster for them to make the correct decision.

FileZilla
For as long as FTP has been around, and ubiquitous as it may seem, one might conclude that it all works the same.  Not true.  FTP Servers will have their own unique behaviors, just as FTP clients will have their own quirks.  The firewalls that the FTP traffic pass through add another variable that can frustrate end users and Administrators alike.  FileZilla seems to offer the most flexibility when working with remote FTP servers, and is what I use to handle a variety of different FTP needs.  FileZilla won’t eliminate inherent complexities with the FTP protocol as it traverses multiple networks, it just makes it easier to negotiate.

Enjoy!

March 16, 2011

Zero to 32 Terabytes in 30 minutes. My new EqualLogic PS4000e

Filed under: Architecture/planning,Protection,Replication,virtualization — ITforMe @ 4:15 am
Tags: ,

Rack it up, plug it in, and away you go.  Those are basically the steps needed to expand a storage pool by adding another PS array using the Dell/EqualLogic architecture.  A few weeks ago I took delivery of a new PS4000e to compliment my PS6000e at my primary site.  The purpose of this additional array was really simple.  We needed raw storage capacity.  My initial proposal and deployment of my virtualized infrastructure a few years ago was a good one, but I deliberately did not include our big flat-file storage servers in this initial scope of storage space requirements.  There was plenty to keep me occupied between the initial deployment, and now.  It allowed me to get most of my infrastructure virtualized, and gave a chance for buy-in to the skeptics who thought all of this new-fangled technology was too good to be true.  Since that time, storage prices have fallen, and larger drive sizes have become available.  Delaying the purchase aligned well with “just-in-time” purchasing principals, and also gave me an opportunity to address the storage issue in the correct way.   At first, I thought all of this was a subject matter not worthy of writing about.  After all, EqualLogic makes it easy to add storage.  But that only addresses part of the problem.  Many of you face the same dilemma regardless of what your storage solution is; user facing storage growth.

Before I start rambling about my dilemma, let me clarify what I mean by a few terms I’ll be using; “user facing storage” and “non user facing storage.” 

  • User Facing Storage is simply the storage that is presented to end users via file shares (in Windows) and NFS mounts (in Linux).  User facing storage is waiting there, ready to be sucked up by an overzealous end user. 
  • Non User Facing Storage is the storage occupied by the servers themselves, and the services they provide.  Most end users generally have no idea on how much space a server reserves for say, SQL databases or transaction logs (nor should they!)  Non user facing storage is easier to anticipate needs and manage because it is only exposed to system administrators. 

Which array…

I decided to go with the PS4000e because of the value it returns, and how it addresses my specific need.  If I had targeted VDI or some storage for other I/O intensive services, I would have opted for one of the other offerings in the EqualLogic lineup.  I virtualized the majority of my infrastructure on one PS6000e with 16, 1TB drives in it, but it wasn’t capable of the raw capacity that we now needed to virtualize our flat-file storage.  While the effective number of 1GB ports is cut in half on the PS4000e as compared to the PS6000e, I have not been able to gather any usage statistics against my traditional storage servers that suggest the throughput of the PS4000e will not be sufficient.  The PS4000e allowed me to trim a few dollars off of my budget line estimates, and may work well at our CoLo facility if we ever need to demote it.

I chose to create a storage pool so that I could keep my volumes that require higher performance on the PS6000, and have the dedicated storage volumes on the PS4000.  I will do the same for when I eventually add other array types geared for specific roles, such as VDI.

Truth be told, we all know that 16, 2 terabyte drives does not equal 32 Terabytes of real world space.  RAID50 penalty knocks that down to about 21TB.  Cut that by about half for average snapshot reserves, and it’s more like 11TB.  Keeping a little bit of free pools space available is always a good idea, so let’s just say it effectively adds 10TB of full fledged enterprise class storage.  This adds to my effective storage space of 5TB on my PS6000.  Fantastic.  …but wait, one problem.  No, several problems.

The Dilemma

Turning up the new array was the easy part.  In less than 30 minutes, I had it mounted, turned on, and configured to work with my existing storage group.  Now for the hard part; figuring out how to utilize the space in the most efficient way.  User facing storage is a wildcard; do it wrong and you’ll pay for it later.  While I didn’t know the answer, I did know some things that would help me come to an educated decision.

  • If I migrate all of the data on my remaining physical storage servers (two of them, one Linux, and one Windows) over to my SAN, it will consume virtually all of my newly acquired storage space.
  • If I add a large amount of user-facing storage, and present that to end users, it will get sucked up like a vacuum.
  • If I blindly add large amounts of great storage at the primary site without careful thought, I will not have enough storage at the offsite facility to replicate to.
  • Large volumes (2TB or larger) not only run into technical limitations, but are difficult to manage.  At that size, there may also be a co-mingling of data that is not necessarily business critical.  Doling out user facing storage in large volumes is easy to do.  It will come back to bite you later on.
  • Manipulating the old data in the same volume as new data does not bode well for replication and snapshots, which look at block changes.  Breaking them into separate volumes is more effective.
  • Users will not take the time or the effort clean up old data.
  • If data retention policies are in place, users will generally be okay with it after a substantial amount of complaining. It’s not too different than the complaining you might here when there are no data retention policies, but you have no space.  Pick your poison.
  • Your users will not understand data retention policies if you do not understand them.  Time for a plan.

I needed a way to compartmentalize some of the data so that it could be identified as “less important” and then perhaps live on less important storage.  By “less important storage” this could mean that it lives on a part of the SAN that is not replicated, or in a worst case scenario, on even some old decommissioned physical servers, where it resides for a defined amount of time before it is permanently archived and removed from the probationary location.

The Solution (for now)

Data Lifecycle management.  For many this means some really expensive commercial package.  This might be the way to go for you too.  To me, this is really nothing more than determining what is important data, and what isn’t as important, and having a plan to help automate the demotion, or retirement of that data.  However, there is a fundamental problem of this approach.  Who decides what’s important?  What are the thresholds?  Last accessed time?  Last modified time?  What are the ramifications of cherry-picking files from a directory structure because they exceed policy thresholds?  What is this going to break?  How easy is it to recover data that has been demoted?  There are a few steps that I need to do to accomplish this. 

1.  Poor man’s storage tiering.  If you are out of SAN space, re-provision an old server.  The purpose of this will be to serve up volumes that can be linked to the primary storage location through symbolic links.  These volumes can then be backed up at a less frequent interval, as it would be considered less important.  If you eventually have enough SAN storage space, these could be easily moved onto the SAN, but in a less critical role, or on a SAN array that has larger, slower disks.

2.  Breaking up large volumes.  I’m convinced that giant volumes do nothing for you when it comes to understanding and managing the contents.  Turning larger blobs into smaller blobs also serves another very important role.  It allows the intelligence of the EqualLogic solutions to do their work on where the data should live in a collection of arrays.  A storage Group that consists of say, an SSD based array, a PS6000, and a PS4000 can effectively store the volumes in the correct array that best suites the demand.

3.  Automating the process.  This will come in two parts; a.) deciding on structure, policies, etc. and b.) making or using tools to move the files from one location to another.  On the Linux side, this could mean anything from a bash script, or something written in python.  Then use cron to schedule the occurrence.  In Windows, you could leverage PowerShell, vbscript, or batch files.  This would be as simple, or as complex as your needs require.  However, if you are like me, you have limited time to tinker with scripting.  If there is something turn-key that does the job, go for it.  For me, that is an affordable little utility called “TreeSize Pro”  This gives you not only the ability to analyze the contents of NTFS volumes, but can easily automate the pruning of this data to another location.

4.  Monitoring the result.  This one is easy to overlook, but you will need to monitor the fruits of your labor, and make sure it is doing what it should be doing; maintaining available storage space on critical storage devices.  There are a handful of nice scripts that have been written for both platforms that help you monitor free storage space at the server level.

The result

The illustration below helps demonstrate how this would work. 

image

As seen below, once a system is established to automatically move and house demoted data, you can more effectively use storage on the SAN.

image

Separation anxiety…

In order to make this work, you will have to work hard in making sure that the all of this is pretty transparent to the end user.  If you have data that has complex external references, you would want to preserve the integrity of the data that relies on those dependent files.  Hey, I never said this was going to be easy. 

A few things worth remembering…

If 17 years in IT, and a little observation in human nature has taught me one thing, it is that we all undervalue our current data, and overvalue our old data.  You see it time and time again.  Storage runs out, and there are cries for running down to the local box store and picking up some $99 hard drives.  What needs to reside on there is mission critical (hence the undervaluing of the new data).  Conversely, efforts to have users clean up old data from 10+ years ago had users hiding files in special locations, even though it was recorded that it had not been modified, or even accessed in 4+ years.  All of this of course lives on enterprise class storage.  An all too common example of overvaluing old data.

Tip.  Remember your Service Level Agreements.  It is common in IT to not only have SLAs for systems and data, but for one’s position.  These without doubt are tied to one another.  Make sure that one doesn’t compromise the other.  Stop gap measures to accommodate more storage will trigger desperate, affordable solutions.  (e.g. adding cheap non-redundant drives in an old server somewhere).  Don’t do it!  All of those arm-chair administrators in your organization will be nowhere to be found when those drives fail, and you are left to clean up the mess.

Tip.  Don’t ever thin provision user facing storage.  Fortunately, I was lucky to be clued into this early on, but I could only imagine the well intentioned administrator who wanted to present a nice amount of storage space to the user, only to find it sucked up a few days later.  Save the thin provisioning for non user facing storage (servers with SQL databases and transaction logs, etc.)

Tip.  If you are presenting proposals to management, or general information updates to users, I would suggest quoting only the amount of effective, usable space that will be added.  In other words, don’t say you are adding 32TB to your storage infrastructure, when in fact, it is closer to 10TB.  Say that it is 10TB of extremely sophisticated, redundant enterprise class storage that you can “bet the business” on.  It’s scalability, flexibility and robustness is needed for the 24/7 environments we insist upon.  It will just make it easier that way.

Tip.  It may seem unnecessary to you, but continue to show off snapshots, replication, and other unique aspects of SAN storage, if you still have those who doubt the power of this kind of technology – especially when they see the cost per TB.  Repeat to them how long (if even possible) it would take to protect that same data under traditional storage.  Do everything you can to help those who approve these purchases.  More than likely, they won’t be as impressed by say, how quick a snapshot is, but rather, shocked how traditional storage can’t be protected very well.

You may have noticed I do not have any rock-solid answers for managing the growth and sustainability of user facing data.  Situations vary, but the factors that help determine that path for a solution are quite similar.  Whether you decide on a turn-key solution, or choose to demonstrate a little ingenuity in times of tight budgets, the topic is one that you will probably have to face at some point.

 

January 28, 2011

How I use Dell/EqualLogic’s SANHQ in my environment

Filed under: Replication,virtualization — ITforMe @ 3:45 am
Tags: , , ,

 

One of the benefits of investing in Dell/EqualLogic’s SAN solutions are the number of great tools included with the product, at no extra charge.  I’ve written in the past about leveraging their AutoSnapshot Manager for VM and application consistent snapshots and replicas.  Another tool that deserves a few words is SAN HeadQuarters (SANHQ). 

SANHQ allows for real-time and historical analysis of your EqualLogic arrays.  Many EqualLogic users are well versed with this tool, and may not find anything here that they didn’t already know.  But I’m surprised to hear that many are not.  So, what better way to help those unfamiliar with SANHQ than to describe how it helps me with my environment.

While the tool itself is “optional” in the sense that you don’t need to deploy it to use the EqualLogic arrays, it is an easy (and free) way to expose the powers of your storage infrastructure.  If you want to see what your storage infrastructure is doing, do yourself a favor and run SANHQ.   

Starting up the application, you might find something like this:

image

You’ll find an interesting assortment of graphs, and charts that help you decipher what is going on with your storage.  Take a few minutes and do a little digging.  There are various ways that it can help you do your job better.

 

Monitoring

Sometimes good monitoring is downright annoying.  It’s like your alarm clock next to the bed; it’s difficult to overlook, but that’s the point.  SANHQ has proven to be an effective tool for proactive monitoring and alerting of my arrays.  While some of these warnings are never fun, it’s biggest value is that it can help prevent those larger, much more serious problems, which always seem to be a series of small issues thrown together.  Here are some examples of how it has acted as the canary in the coalmine for me in my environment.

  • When I had a high number of TCP retransmits after changing out my SAN Switchgear, it was SANHQ that told me something was wrong.  EqualLogic Support helped me determine that my new switchgear wasn’t handling jumbo frames correctly. 
  • When I had a network port go down on the SAN, it was SANHQ that alerted me via email.  A replacement network cable fixed the problem, and the alarm went away.
  • If replication across groups is unable to occur, you’ll get notified right away that replication isn’t running.  The reasons for this can be many, but SANHQ usually gives you the first sign that something is up.  This works across physical topologies where your target my be at another site.
  • Under maintenance scenarios, you might find the need to pause replication on a volume, or on the entire group.  SANHQ will do a nice job of reminding you that it’s still not replicating, and will bug you at a regular interval that it’s still not running.

 

Analysis and Planning

SANHQ will allow you to see performance data at the group level, by storage pools, volumes, or volume collections.  One of the first things I do when spinning up a VM that uses guest attached volumes, is to jump into SANHQ, and see how those guest attached volumes are running.  How are the average IOPS? What about Latencies and Queue depth?  All of those can be found  easily in SANHQ, and can help put your mind at ease if you are concerned about your new virtualized Exchange or SQL servers.  Here is a screenshot of a 7 day history for SQL server with guest attached volumes, driving our SharePoint backend services.

image

The same can be done of course for VMFS volumes.  This information will compliment existing data one gathers from vCenter to understand if there are performance issues with a particular VMFS volume.

Often times monitoring and analysis isn’t about absolute numbers, but rather, allowing the user to see changes relative to previous conditions.  This is especially important for the IT generalist who doesn’t have time or the know-how for deep dive storage analysis, or have a dedicated Storage Administrator to analyze the data.  This is where the tool really shines.  For whatever type of data you are looking at, you can easily choose a timeline by the last hour, 8 hours, 1 day, 7 days, 30 days, etc.  The anomalies, if there are any, will stand out. 

image

Simply click on the Timeline that you want, and the historical data of the Group, member, volume, etc will show up below.

image

I find analyzing individual volumes (when they are VMFS volumes) and volume collections (when they are guest attached volumes) the most helpful in making sure that there are not any hotspots in I/O.  It can help you determine if a VM might be better served in a VMFS volume that hasn’t been demanding as much I/O as the one it’s currently in.

It can also play a role in future procurement.  Those 15k SAS drives may sound like a neat idea, but does your environment really need that when you decide to add storage?  Thinking about VDI?  It can be used to help determine I/O requirements.  Recently, I was on the phone with a friend of mine, Tim Antonowicz.  Tim is a Senior Solutions Architect from Mosaic Technology who has done a number of successful VDI deployments (and who recently started a new blog).  We were discussing the possibility of VDI in my environment, and one of the first things he asked of me was to pull various reports from SANHQ so that he could understand our existing I/O patterns.  It wasn’t until then that I noticed all of the great storage analysis offerings that any geek would love.  There are a number of canned reports that can be saved out as a pdf, html, csv, or other format to your liking.

image

Replication Monitoring

The value of SANHQ went way up for me when I started replication.  It will give you summaries of the each volume replicated.

image

If you click on an individual volume, it will help you see transfer sizes and replication times of the most recent replicas.  It also separates inbound replica data from outbound replica data.

image

While the times and the transfer rates will be skewed somewhat if you have multiple replica’s running (as I do), it is a great example on how you can understand patterns in changed data on a specific volume.  The volume captured above represents where one of my Domain Controllers lives.  As you can see, it’s pretty consistent, and doesn’t change much, as one would expect (probably not much more than the swap file inside the VM, but that’s another story).  Other kinds of data replicated will fluctuate more widely.  This is your way to see it.

 

Running SANHQ

SANHQ will live happily on a stand alone VM.  It doesn’t require much, but does need direct access to your SAN, and uses SNMP.  Once installed, SANHQ can be run directly on that VM, or the client-only application can be installed on your workstation for a little more convenience.  If you are replicating data, you will want SANHQ to connect to the source site, and the target site, for most effective use of the tool.

Improvements?  Sure, there are a number of things that I’d love to see.  Setting alarms for performance thresholds.  Threshold templates that you could apply to a volume (VMFS or native) that would help you understand the numbers (green = good.  Red = bad).  The ability to schedule reports, and define how and where they are posted.  Free pool space activity warnings (important if you choose to keep replica reserves low and leverage free pool space).  Array diagnostics dumps directly from SANHQ.  Programmatic access for scripting.  Improvements like these could make a useful product become indispensible in a production environment.

December 15, 2010

Finally. A practical solution to protecting Active Directory

 

Active Directory.  It is the brains of most modern-day IT infrastructures, providing just about every conceivable control of how users, computers and information will interact with each other.  Authentication, user, group and computer access control, all help provide logical barriers that allow for secure access, but a seamless user experience with single sign-on access to resources.  While it has the ability to improve and integrate critical services such as DNS, DHCP, and NTP, in many ways those services become dependent on Active Directory.  These days, Active Directory controls more than just pure Windows environments.  Integration with non Microsoft Operating systems like Ubuntu, Suse, and VMWare’s vSphere is becoming more common thanks to products such as LikeWise.  The environment that I manage has Windows Servers and clients, most distributions of Linux, Macs, a few flavors of Unix, VMware, and iPhones.  All of them rely on Active Directory.  You quickly learn that if Active Directory goes down, so does your job security.

Active Directory will run happily even under less than ideal circumstances.  It is incredibly resilient, and somehow can put up with server crashes, power outages, and all sorts of debauchery.  But neglect is not a required ingredient for things to go wrong.  When it does, the results can be devastating.  AD problems can be difficult to track down, and it’s tentacles will affect services you never considered.  A corrupt Active Directory, or the Controllers it runs on, can make your Exchange and SQL servers crumble around you.  I lived through this experience (barely) a while back, and even though my preparation for such scenarios looked very good on paper, I spent a healthy amount of time licking my wounds, and reassessing my backup strategy of Active Directory.  I never want to put myself in that position again.

As important as Active Directory is, it can be quiet challenging to protect.  Why?  I believe the answer can be boiled down into two main factors; it’s distributed, and it’s transaction based.  In other words, the two traits that makes it robust also makes it difficult to protect.  Large enterprises usually have a well architected AD infrastructure, and at least understand the complexities of protecting their AD environment.  Many others are left with pondering the various ways to protect it.

  • File based backups using traditional backup methods.  This has never been enough, but my bet is that you’d find a number of smaller environments do this – if they do anything at all.  It has worked for them only because they’ve never had a failure of any sort.
  • AD backup agents that are a part of a commercial backup application.  Some applications like Symantec Backup Exec (what I previously relied on) seem like a good idea, but show their true colors when you actually try to use it for recovery.  While the agents should be extending the functionality of the backup software, they just add to an already complex solution that feels like a monstrosity geared for other purposes.
  • Exporting AD on Windows 2008 based Domain controllers by using NTDSUTIL and the like.  This is difficult at best, arguably incomplete, and if you have a mix of Windows 2008 and Windows 2003 DC’s, won’t work.
  • Those who have virtualized their domain controllers often think that the well timed independent snapshot or VCB backup will protect them.  This is not true either.  You will have a VM consistent backup of the VM itself, but it does nothing to coordinate the application with the other Domain Controllers and the integrity of it’s contents.  In theory, they could be backed up properly if every single DC was shut down at the same time, but most of us know that would not be a solution at all.
  • Dedicated Solutions exist to protect Active Directory, but can be overly complex, and outrageously expensive.  I’m sure they do their job well, but I couldn’t get the line item past our budget line owner to find out.

The result can be a desire to want to protect AD, but uncertainty on what “protect” really means.  Is protecting the server good enough?  Is protecting AD itself enough?  Does one need both, and if so, how does one go about doing that?  Without fully understanding the answers to those questions, something inevitably goes wrong, and the Administrator is frantically flipping through the latest TechNet Article on Authoritative Restores, while attempting to figure out their backup software.  It’s particularly painful to the Administrator, who had the impression that they were protecting their Organization (and themselves) when in fact, they were not. 

In my opinion, protecting the domain should occur at two different levels.

  • Application layer.  This is critical.  Among other things, the backup will coordinate Active Directory so that all of it’s Update Sequence Numbers (USN’s) are at an agreed upon state.  This will avoid USN’s that are out of sync, which can be the trouble of so many AD related problems.  Application layer protection should also honor these AD specific attributes so that granular recovery of individual objects is possible.  Good backup software should leverage API’s that take advantage of Volume Shadow Copy Services (VSS).
  • Physical layer.  This protects the system that the services may be running on.  If it’s a physical server, it could be using some disk imaging software such as Acronis, or Backup Exec System Recovery.  If it’s virtualized, an independent backup of the VM will do.  Some might suggest that protecting the actual machine isn’t technically required.  The idea behind that reasoning is that if there is a problem with the physical machine, or the OS, one can quickly decommission and commission another DC with “dcpromo.”  While protecting the system that AD runs on may not be required, it may help speed up your ability (in conjunction with Application layer protection) to correct issues from a previously known working state.

I was introduced to CionSystems by a colleague of mine who suggested their “Active Directory Self-Service” product to help us with another need of ours.  Along the way, I couldn’t help but notice their AD backup offering.  Aptly named, “Active Directory Recovery” is a complete application layer solution.  I tried it out, and was sold.  It allows for a simple, coordinated backup and recovery of Active Directory.  A recovery can be either a complete point-in-time, or a granular restore of an object.  It is agentless, meaning that you don’t have to install software on the DCs.  The first impression after working with it is that it was designed for one purpose; to backup Active Directory.  It does it, and does it well.

The solution will run on any spare machine running IIS and SQL.  Once installed, configuring it is just a matter of pointing it to your Domain Controller that runs the PDC Emulator role.  After a few configuration entries are made, the Administration console can be accessed with your web browser from anywhere on your network.

image

The next step is to set up a backup job, and let it run.  That’s it.  Fast, simple, and complete.  From the home page, there are a few different ways you can look at objects that you want to recover.

If it’s a deleted object, you can click on the “Deleted Objects” section.  Objects with a backup to restore from will show up in green, and present itself below each object.  Below you will see a deleted computer object, and the backups that it can be restored from.

image

The “List Backups” simply shows the backups created in chronological order.  From there you can do full restores, or restore an individual object that still exists in AD.  Unlike authoritative restores, you do not have to do any system restarts.

image

During the restore process, “Active Directory Recovery” will expose individual attributes of the object that you want to restore – if you wish for the restore to be that granular.  If it’s restorable, there is a checkbox next to it.  Non-modifiable objects will not have a checkbox next to it.

image

One of my favorite features is that it provides a way for a true, portable backup.  One can export the backup to a single file (a proprietary .bin file) that is your entire AD backup, and save it onto a CD, or to a remote location.  This is a wish list item I’ve had for about as long as AD has been around.    There are many other nice features, such as email notifications, filtering and comparison tools, as well as backup retention settings. 

I use this product to compliment my existing strategy for protecting my AD infrastructure.  While my virtualized Domain Controllers are replicated to a remote site (the physical protection, so to speak), I protect my AD environment at the application level with this product.  The server that “Active Directory Recovery” runs on is also replicated, but to be extra safe, I create a portable/exported backup that is also shipped off to the offsite location.  This way I have a fully independent backup of AD.  If I’m doing some critical updates to my Domain Controllers, I first make a backup using Active Directory Recovery, then make my snapshots on my virtualized DC’s  That way, I have a way to roll back the changes that are truly application consistent.

After using the product for a while, I can appreciate that I don’t have to invest much time to keep my backups up and running.  I previously used Symantec’s Backup Exec to protect AD, but grew tired of agent issues, licensing problems, and the endless backup failure messages.  I lost confidence in its ability to protect AD, and am not interested in going back. 

Hopefully this gives you a little food for thought on how you are protecting your Active Directory environment.  Good luck!

October 31, 2010

Replication with an EqualLogic SAN; Part 5

 

Well, I’m happy to say that replication to my offsite facility is finally up and running now.  Let me share with you the final steps to get this project wrapped up. 

You might recall that in my previous offsite replication posts, I had a few extra challenges.  We were a single site organization, so in order to get replication up and running, an infrastructure at a second site needed to be designed and in place.  My topology still reflects what I described in the first installment, but simple pictures don’t describe the work getting this set up.  It was certainly a good exercise in keeping my networking skills sharp.  My appreciation for the folks who specialize in complex network configurations, and address management has been renewed.  They probably seldom hear words of thanks for say, that well designed sub netting strategy.  They are an underappreciated bunch for sure.

My replication has been running for some time now, but this was all within the same internal SAN network.  While other projects prevented me from completing this sooner, it gave me a good opportunity to observe how replication works.

Here is the way my topology looks fully deployed.

image

Most Collocations or Datacenters give you about 2 square feet to move around, (only a slight exaggeration on the truth) so it’s not the place you want to be contemplating reasons why something isn’t working.  It’s also no fun realizing you don’t have the remote access you need to make the necessary modifications, and you don’t, or can’t drive to the CoLo.  My plan for getting this second site running was simple.  Build up everything locally (switchgear, firewalls, SAN, etc.) and change my topology at my primary site to emulate my the 2nd site.

Here is the way it was running while I worked out the kinks.

image

All replication traffic occurs over TCP port 3260.  Both locations had to have accommodations for this.  I also had to ensure I could manage the array living offsite.  Testing this out with the modified infrastructure at my primary site allowed me to verify traffic was flowing correctly.

The steps taken to get two SAN replication partners transitioned from a single network to two networks (onsite) were:

  1. Verify that all replication is running correctly when the two replication partners are in the same SAN Network
  2. You will need a way to split the feed from your ISP, so if you don’t have one already, place a temporary switch at the primary site on the outside of your existing firewall.  This will allow you to emulate the physical topology of the real site, while having the convenience of all of the equipment located at the primary site. 
  3. After the 2nd firewall (destined for the CoLo) is built and configured, place it on that temporary switch at the primary site.
  4. Place something (a spare computer perhaps) on the SAN segment of the 2nd firewall so you can test basic connectivity (to ensure routing is functioning, etc) between the two SAN networks. 
  5. Pause replication on both ends, take the target array and it’s switchgear offline. 
  6. Plug the target array’s Ethernet ports to the SAN switchgear for the second site, then change the IP addressing of the array/group so that it’s running under the correct net block.
  7. Re-enable replication and run test replicas.  Starting out with the Group Manager.  Then to ASM/VE, then onto ASM/ME.

It would be crazy not to take one step at a time on this, as you learn a little on each step, and can identify issues more easily.  Step 3 introduced the most problems, because traffic has to traverse routers that also are secure gateways.  Not only does one have to consider a couple of firewalls, you now run into other considerations that may be undocumented.  For instance.

  • ASM/VE replication occurs courtesy of vCenter.  But ASM/ME replication is configured inside the VM.  Sure, it’s obvious, but so obvious it’s easy to overlook.  That means any topology changes will require adjustments in each VM that utilize guest attached volumes.  You will need to re-run the “Remote Setup Wizard” to adjust the IP address of the target group that you will be replicating to.
  • ASM/ME also uses a VSS control channel to talk with the array.  If you changed the target array’s group and interface IP addresses, you will probably need to adjust what IP range will be allowed for VSS control.
  • Not so fast though.  VM’s that use guest iSCSI initiated volumes typically have those iSCSi dedicated virtual network cards set with no default gateway.  You never want to enter more than one default gateway on this sort of situation.  The proper way to do this will be to add a persistent static route.  This needs to be done before you run the remote Setup Wizard above.  Fortunately the method to do this hasn’t changed for at least a decade.  Just type in

route –p add [destinationnetwork] [subnetmask] [gateway] [metric]

  • Certain kinds of traffic that passes almost without a trace across a layer 2 segment shows up right away when being pushed through very sophisticated firewalls who’s default stances are deny all unless explicitly allowed.  Fortunately, Dell puts out a nice document on their EqualLogic arrays.
  • If possible, it will be easiest to configure your firewalls with route relationships between the source SAN and the target SAN.  It may complicate your rulesets (NAT relationships are a little more intelligent when it comes to rulesets in TMG), but it simplifies how each node is seeing each other.  This is not to say that NAT won’t work, but it might introduce some issues that wouldn’t be documented.

Step 7 exposed an unexpected issue; terribly slow replicas.  Slow even though it wasn’t even going across a WAN link.  We’re talking VERY slow, as in 1/300th the speed I was expecting.  The good news is that this problem had nothing to do with the EqualLogic arrays.  It was an upstream switch that I was using to split my feed from my ISP.  The temporary switch was not negotiating correctly, and causing packet fragmentation.  Once that switch was replaced, all was good.

The other strange issue was that even though replication was running great in this test environment, I was getting errors with VSS.  ASM/ME at startup would indicate “No control volume detected.”  Even though replicas were running, the replica’s can’t be accessed, used, or managed in any way.  After a significant amount of experimentation, I eventually opened up a case with Dell Support.  Running out of time to troubleshoot, I decided to move the equipment offsite so that I could meet my deadline.  Well, when I came back to the office, VSS control magically worked.  I suspect that the array simply needed to be restarted after I had changed the IP addressing assigned to it. 

My CoLo facility is an impressive site.  Located in the Westin Building in Seattle, it is also where the Seattle Internet Exchange (SIX) is located.  Some might think of it as another insignificant building in Seattle’s skyline, but it plays an important part in efficient peering for major Service Providers.  Much of the building has been converted from a hotel to a top tier, highly secure datacenter and a location in which ISP’s get to bridge over to other ISP’s without hitting the backbone.  Dedicated water and power supplies, full facility fail-over, and elevator shafts that have been remodeled to provide nothing but risers for all of the cabling.  Having a CoLo facility that is also an Internet Exchange Point for your ISP is a nice combination.

Since I emulated the offsite topology internally, I was able to simply plug in the equipment, and turn it on, with the confidence that it will work.  It did. 

My early measurements on my feed to the CoLo are quite good.  Since the replication times include buildup and teardown of the sessions, one might get a more accurate measurement on sustained throughput on larger replicas.  The early numbers show that my 30mbps circuit is translating to replication rates that range in the neighborhood of 10 to 12GB per hour (205MB per min, or 3.4MB per sec.).  If multiple jobs are running at the same time, the rate will be affected by the other replication jobs, but the overall throughput appears to be about the same.  Also affecting speeds will be other traffic coming to and from our site.

There is still a bit of work to do.  I will monitor the resources, and tweak the scheduling to minimize the overlap on the replication jobs.  In past posts, I’ve mentioned that I’ve been considering the idea of separating the guest OS swap files from the VM’s, in an effort to reduce the replication size.  Apparently I’m not the only one thinking about this, as I stumbled upon this article.  It’s interesting, but a nice amount of work.  Not sure if I want to go down that road yet.

I hope this series helped someone with their plans to deploy replication.  Not only was it fun, but it is a relief to know that my data, and the VM’s that serve up that data, are being automatically replicated to an offsite location.

Next Page »

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.