Saturday, September 03, 2011

F5 BIGIP LTM Reboot Script

In an effort to ensure the best performance and stability of our two BIGIP LTM 6400 Load Balancers I have created a script to synchronise and reboot the units regularly.

This script runs a series of checks before rebooting the unit.
  1. Check Active/Standby state based upon the output of bigpipe failover show 
  2. Check Peer status (up/down) - based upon the result of ping -c 1 -w 5 peer ('peer' is the hostname of the peer BIGIP) 
  3. Check the uptime to see when the last time the unit was started, if under a given period then don't reboot 
  4. Check configuration synchronisation status based upon the output of bigpipe config sync show 
If the configuration is not in sync then it will attempt to synchronise the configuration using bigpipe config sync all and check the status of the synchronisation again. If the configuration is still not in synch it will exit and not reboot the unit.

Each check/task will output to STDOUT and syslog (facility: local0.notice tag: BIGIP-ADMIN-SCRIPT). Also a result file (/tmp/reboot-cron-job-result) that will be left in place until next run and will also be e-mailed to 'user@domain.tld' (change this to suit your environment).

The same reboot.sh script is used on each unit.

Still some tidying up to do - like using a lockfile and better error handling like using 'set -e' and 'set -u' and traps.

/home/admin/reboot.sh


I run this script using a cron job that occurs at the start of our weekly maintenance window. You can also use this script as a safe way to force a failover and reboot.


Monday, July 04, 2011

F5 BIGIP LTM Maintenance Page Update for v10

The folks at F5 devcentral have kindly provided a number of 'Maintenance Page' examples that allow you to host a page directly from the BIGIP LTM and display it automatically when all pool members go off-line. The example I used is http://devcentral.f5.com/wiki/default.aspx/iRules/LTMMaintenancePage.html (login required, registration is free).

However there are a few changes required to get it working with the latest version of TMOS (v10).

Follow the instructions provided in the aforementioned link and change them as follows:

Create iRule Data Groups with the following information:

maint_index_html_class

General Properties
Name: maint_index_html_class
Partition: Common
Type: (External File)

Records
Path/Filename: /var/class/maint.index.html.class
File Contents: String
Key/Value Pair Selector: :=
Access Mode: Read/Write

The file will need to look like the following (add "index.html" := to the beginning of existing example):



main_index_logo_class

General Properties
Name: maint_index_logo_class
Partition: Common
Type: (External File)

Records
Path/Filename: /var/class/maint.logo.png.class
File Contents: String
Key/Value Pair Selector: :=
Access Mode: Read/Write

The file will need to look like the following (add "logo.png" := to the beginning of the existing example):



generic_irule_maintenance_page
  • Change [lindex $::maint_index_html_class 0] with [class element -value 0 maint_index_html_class] 
  • Change [b64decode [lindex $::maint_logo_png_class 0]] with [b64decode [class element -value 0 maint_index_logo_png_class]] 

Tuesday, May 31, 2011

F5 BIGIP and Blackboard Collaboration Server

Blackboard Collaboration Server is a separate, optional, web server that provides virtual classroom and chat tools. As part of the university’s Blackboard application upgrade I have been asked to develop a way to add resilience to the collaboration server side of the application where possible.

The brief is to provide failover only. The reason for this is that the collaboration server is not “load balancing aware” in that it assumes that it will be hosted on a single host. To provide rudimentary fail-over capability I have set up a method that will switch all sessions to another host should the active host fail. However clients will stay on the new host until it fails and only then switch all sessions to the other. The key word here is ‘all’ because it’s important to keep all sessions on the same host.

From a users perspective; in the event of an active host outage they will lose connectivity but will be able to log back in straight away and continue until such a time the alternative host fails. This prevents them from being switched over only to be kicked again when the prior host has been restored and also ensures that ALL sessions are sent to a single host and not spread across multiple host, so everyone is in the same chatrooms.

My first idea was to adapt BIGIP’s Priority Group capability however this presented the same problem where I could not ‘stick’ the clients to a server. As soon as a same or higher priority server was restored the sessions would be sent to the new host effectively splitting the chat rooms. Also load balancing will take place on member servers of the same priority.

So I did a bit of digging around and discovered a method of using an iRule to provide me with the capability to ‘stick’ sessions based upon an arbitrary number in this case I used the TCP Port number.

The iRule is as follows:



CLIENT_ACCEPTED is an event that is triggered when a connection has been established between a client device and the BIGIP.

‘persist uie’ is where I am manipulating the connection persistence and in this case the Universal Inspection Engine. Here I am simply setting a integer, can be any number but I have chosen to use the connecting TCP port number ([TCP::local_port]). This fixes the session persistence to a single host, preventing load-balancing.

The following BIGIP configuration has been tested as working by  business systems analyst using a combination of application logs, BIGIP statistics and packet captures. He confirmed what traffic was being sent on which ports - Port 8010 is used for the majority of user generated traffic that must be kept on a single host. Port 8443 is used to transport application specific information but does not carry anything that is user generated and therefore does not require persistence.

The aforementioned iRule is referenced by a ‘Universal Persistence’ profile as follows:



And then reference that Univeral Persistence profile from a Performace Layer 4 type Virtual Server like so::



Another Virtual Server is required for HTTPS traffic however this does not require any special configuration and is set up as a typical HTTP type Virtual Server e.g:


The above configuration refers to a six member/node pool. Each member runs both the general Blackboard application and the Collaboration Service. We have yet to load test the combination of the application and collaboration services and how they influence how the BIGIP balances the load across the members - considering using ‘Observed (node)’ as opposed to the current ‘Observed (member)’ method since the same nodes are used in multiple pools. Although at some stage I would like to look at uses Dynamic Ratio if it can play nicely with persistent connections.

References:
http://devcentral.f5.com/wiki/default.aspx/iRules/CLIENT_ACCEPTED.html
http://devcentral.f5.com/wiki/default.aspx/iRules/persist.html
http://support.f5.com/kb/en-us/archived_products/big-ip/manuals/product/bigip4_5admin/BIGip_uie.html

Also take note of:
http://support.f5.com/kb/en-us/solutions/public/4000/100/sol4166.html

Saturday, May 28, 2011

Windows Wireless Clients and the X6148V-GE-TX Ethernet Switching Module

Burnt hard by a bug that exists in a place that makes plenty of sense when you find it but not so much when you’re looking at the symptoms.

I was tasked with establishing an EduRoam presence at a University. Since there was already a suitable wireless infrastructure in place all I needed to do was build a FreeRADIUS server, hook it into the EduRoam federated RADIUS and point the two Cisco 4404 controllers dressed as a WiSM (Wireless Services Module) at it so they authenticate EduRoam clients. Easy!

Getting FreeRADIUS communicating nicely with EduRoam was made more difficult than it needed to be. The configuration information provided from EduRoam was sketchy and inaccurate. It wasn’t until I decided to chuck it out and build the FreeRADIUS configuration from scratch that it worked. EduRoam have some strange ideas on what should be sent on the outer TLS tunnel... it’s the inner tunnel that’s important, the other is just establishing an anonymous TLS connection to the local RADIUS server which will then pass the inner-tunnel to their home campus RADIUS.

Okay, that was a bit tedious however that should be the hard part over with. Authentication was working nicely with the local LDAP directory (Novell eDirectory) and with other federated entities, tested with accounts from James Cook University, AARNET and the Australian Catholic University. Just the simple task of setting up a WLAN on the WiSM and confirming that it works with EduRoam as I had been using my trusty Mikrotik RouterBoard RB433 for testing. Associate a laptop to the new wlan, go to open google and was presented with a rather slow web experience that would basically stall on the first image that tried to load. However pings were fine so end to end connectivity was all there.

Odd. Maybe I left something out/in or perhaps the RADIUS was setting some kind of QoS value on the controllers that I wasn’t aware of. Checked all that out, nope all good. Maybe it’s the laptop? Try a little netbook running Jolicloud - works fine. Okay, lets check with another laptop - win7 - fail! Macbook - works! A Windows wireless client + WiSM + EduRoam problem?? Hang on, lets try the Intranet, works! Lets try a proxy server, works! This is getting annoying, so it’s a Windows wireless client + WiSM + EduRoam + FWSM/NAT + Internet problem??

The next 8 months consisted of running every conceivable check on the data path between a Windows wireless client and the Internet. The Cisco TAC had crawled over the WiSM - all good, the FWSM, hmm old untrusted software, install another one! test again - all good, even the ASR - nope, all good.

So I figured that it must be something I’m just not doing right. I blew away my test environment which consisted of a C4402 wifi controller and C1131AG/C1142N LWAPs, and the second FWSM running the latest software and rebuilt it. However when I did this I had physically relocated all the kit (except FWSM of course) from the data centre to the foyer just outside. In doing this I had disconnected the C4402 from the C6513 and plugged it into a C3750 I had set up for the link between the APs and controller and the trunk back into the general network. This configuration worked!

The test environment at this stage 
So what did introducing a C3750 or simply moving it elsewhere on the network do to fix the issue? This made me think there was something suss going on with the chassis and/or connecting switching modules.

By now the TAC had grown tired of my pokes and prods so I gave our Cisco account manager a nudge and the SR was escalated and an e-mail that was CC’d to ‘Cisco Australia’ popped into my inbox from the Cisco Switching team asking for a webex session so they could waterboard the 6513 chassis that housed the WiSM and FWSM.

The phone call started at 10am Monday morning and didn’t end until 3pm.

We worked through each stage of the data path again. Luckily they had the history of all the other tests I had done so I didn’t have to do many of the captures again. We narrowed down to the X6148V-GE-TX switching module. This was the one element that shared something in common with all the different combinations I had tried. The C4402 test controller was connected to it along with the link to the ASR/Internet. So I connected the C4402 to a port on the module (issue present, not working), ran a capture. Then moved the C4402 to a X6724-SFP module (no issue pressent, working) and ran another capture. Then the TAC guys ran a comparison between the two caps. It seems the X6148 was silently dropping packets, small ones, particularly ACKs from the client - egress to the ASR/Internet.

Gentlemen, we had hit Cisco bug CSCeb67650:

WS-X6548-GE-TX & WS-X6148-GE-TX may drop frames on egress 
Packets destined out the WS-X6548-GE-TX or the WS-X6148-GE-TX that are less than 64 bytes will be dropped. This can occur when a device forwards a packet that is 60 bytes and the 4 byte dot1q tag is to added to create a valid 64 byte packet. When the tag is removed the packet is 60 bytes. If the destination is out a port on the WS-X6548-GE-TX or the WS-X6148-GE-TX it will be dropped by the linecard....

WLC drop TCP ack from wireless client to wired

Symptom: Wireless client has problem loading certain web pages. Conditions: client connected to wireless controller, and has problem loading web pages from certain web sites. Specifically has problem loading pictures. A wired packet capture shows the ack coming from the wireless client are been drop on the controller. Workaround: None

Since there was no workaround the only option was to shift the ASR/Internet link from the X6148 to a X6724. Fixed!

I plan to remove the X6148V-GE-TX from the chassis anyway along with a CSM. These are both ‘classic’ modules that don’t use “fabric switching” (2 x 20Gb dedicated) but instead use an older “bus” method (32Gb shared) thus causing the chassis as a whole to not run as well as it could. However if X61xx modules were all I had then I would be in a pickle.


Note:

Wondering why this only affected Windows clients? So am I.

ACKs aren't all the same 'size' given the comparisons between pcaps I've grabbed from public repos. However ACK frames during a HTTP transfer all seem to be 60bytes long no matter the OS.

I think it could be related to the differences between the Slow Start/Congestion Avoidance algorithms. The ACKs are probably being dropped no matter which OS is sending them, however some OSs might be better at recovering. Something to test. Although this problem shows indiscriminate dropping of 60byte frames so how can they recover??

I haven't been able to find a decent comparison between *nix/BSD/MacOS and Win* TCP stacks. It would be an interesting test to get a Linux box running the same algorithms as a Windows box. When I pull the X6148 out I'll toss it into the test 6509 and hang a test webserver off of it.

Sunday, August 08, 2010

Zimbra Part II

I mentioned a while ago that I will be rolling out a Zimbra mail server. It's been a hard slog but I think I've got it together enough to roll out into production.

I had some grief with bad sectors showing up on the system disk I used. They showed up in the swap partition. When bad sectors show up here applications that are using that virtual memory will show behaviour akin to faulty memory. This meant that when I was migrating e-mail from the old mail server to the Zimbra server it would cause Zimbra to use a bit of swap memory for various things and come across these bad sectors in turn causing it to crash.

Having pinpointed the fault to bad sectors (using dmesg to see the disk errors) I went about imaging from the old disk to a new one. The system disk isn't mirrored, Zimbra lives on a mirror but the system disk stands alone - you can call it a compromise of costs if you like. However the cheap onboard RAID controller is either set to all SATA ports as RAID or none. I have to set up single disk stripes in order to add a single disk. This means that I have to contend with the obscure device mapping between the BIOS, RAID BIOS and the linux device mapper. Juggling all these around I managed to get the new disk in and booting without failed mounts and what not.

I decided after all this to clear the user accounts and aliases and refresh them. The reasons for doing this is that I modified the zmprov script for converting the passwd file to a zmprov command list to include UID and GIDs plus the SambaID. I could have created another script that simply modified each user however I felt it better to run thought the process of clearing and restoring users again just to be sure.

I also updated Zimbra to the latest 6.0.7 release. I also tested the shared calendaring and resource scheduling a bit further to make sure it fits the requirements - all works quite well and I like the different permission levels for managing resources.

I think it would be possible to make a Zimbra appliance of sorts so I wouldn't mind having a go at scripting the installation and packaging it up into a small power efficient server that can be easily used by small businesses or as departmental mail servers. Could be something to add to my consulting work on the side along with cheap and efficient network consulting.

Monday, July 26, 2010

One Of The Situations I Find Myself In

The other day at work reception called and said that they had an Ian from Summerville High School on the phone wanting to know about what kind of software packages were in use by the Zoo. Straight away I had my doubts because we receive many calls from sales people wanting to get a foot in the door which is made easier by developing an understanding of our IT environment. Since it was still a doubt I decided to take the call.

Speaking to Ian on the phone he said that he would like to know more about the database systems in use by the Zoo. He said that he would be bringing in about 16 girls from Summerville High in Brisbane and part of their current education is about databases. It was strange because I would consider the zoo to be the last place to go on an excursion to learn about databases. Unless he had the foresight of knowing that we use databases to track our animals?

I compromised - I told him that he could call me and I would go out and talk about how the zoo uses various software packages with a database back-end.

The day came and I didn't receive a call at 9am when he said he would. I put it down to a failed sales guy and went on with my usual tasks. However at about midday I receive a call from him and it turned out that he was there but with only six girls. So I had to give an impromptu talk about database systems in the Zoo environment to six high school girls. It's not something I had expected to do while working as a IT guy in a zoo.

I think I taught them something. I haven't given a talk to a group about IT topics for sometime so I forgot to do things like gauge their existing knowledge or try to obtain more feedback from them in the form of questions and revisions. It did remind me how much I liked talking about the subject to others and I miss the training side of what I do.

It was but a small break from the mundane.

Monday, May 24, 2010

Zimbra, once more

I remember playing around with the Zimbra Collaboration Suite when it first came into public existence sometime ago. I was working for a different company back then and was looking at it from an ISPs perspective, it was good but wasn't exactly there yet but development was well underway in that regard. I played around with it a bit and stuck it on the 'neat tech to check out later' pile.

Later on and a change of jobs I looked at it again, this time from a medium enterprise perspective. This time I was looking specifically at the per user licensing for use of the Outlook Connector. The costs were okay but the limited testing in our environment proved it to be a bit hit and miss, although I contribute a large portion of the blame towards the lack of any formal directory service or structure.

Now that I'm presented with a rapid upgrade requirement to save our e-mail services (due to shortsightedness of management) I'm having to jump straight into rolling out Zimbra with limited testing. So I blew away a idle Win2k3 SBS server and installed Ubuntu 8.04 LTS Server and tossed on Zimbra 6.0.6.1. I did do some testing beforehand on a xen vm just to make sure it would install and operate okay before I wasted a good 2k3 install.

Currently I'm impressed with how far ZCS has come along. There's plenty of documentation available in the wiki and forums and the 'zmprov' provisioning utility is working wonders with shifting user accounts over from the Sendmail/Dovecot/PAM setup on the old mail server. I am using imapsync to copy the 143GB of email over thanks to the handy scripts provided by the ZCS community.

One thing I like in particular is the ability to dump the {crypt} passwords straight from the shadow file into Zimbra's LDAP - no need to have everyone change their passwords. Although it is recommended and I will get them to do so after I'm satisfied that Zimbra is working okay.

I will update this post with a run down on the scripts I used with any modifications I made.