Friday, November 09, 2007

hype versus reality

Like every engineer, I have to admit, upfront, that I have a limited tolerance to spin-meistering and marketing terminology. I find it boring, repetitive, tiring and annoying. It's also totally predictable - after you identify the marketing speakwords, aka buzzwords, its really annoying to see them over-used again and again and again and again - you get the picture. Most engineers can relate to this. The flip-side is that a certain amount of buzzwords replication actually works! OK - while I don't pretend to understand this phenomenon, I'll buy into it, based on anecdotal evidence. But... there is a point where hype and buzzword usage crosses my personal line-in-the-sand. I guess every technocrat has there own line-in-the-sand.

That line, in my case, is where hype and buzzwords go from being hype, that really can't be verified and validated, to where it is totally bogus and unbelievable. In fact, I'll go further and state that, in some extreme cases, it can be plain dumb and/or possibly dishonest. Such is the case with Marc Hamiltons blog blogs.sun.com/marchamilton/entry/busy_weekend entry which states that the indiana project preview had seen 100,000 downloads in less than 72 hours. Why did I think that this number was just plain wrong? Because, I had an email exchange with Jesse Silver of Sun and he asked me if I could provide download numbers for the Project Indiana Developer Preview (filename in-preview.iso) that we were mirroring on www.genunix.org and he told me to expect "big numbers". So I asked him, "what do you mean by big numbers?" and he said that they had seen over 100,000 downloads from dlc.sun.com. I was curious - because, my initial reaction, was that I did'nt (personally) feel that there was this level of interest in Indiana - particularly since the marketing team had not released it under the widely expected name of Project Indiana, but instead, had chosen to rename/re-brand it, to the OpenSolaris Developer Preview - which no-one, including me, really expected. Well, after taking a look at genunix.org's numbers: we had shipped 690 copies at that time (Mon Nov 5 11:52:54 PST 2007), I just did'nt see that level of interest. After a quick back-of-the-napkin calculation, I knew those numbers were just plain wrong - and I advised Jesse that I felt that those numbers were flawed. Why? Well, for the answer, take a look at the following email I sent to Marc Hamilton (3 days) later in the week after I noticed his blog entry (referred to in a post to one of the OpenSolaris mailing lists which prompted me to read it):

--------- begin Marc Hamilton email -----------
Date: Thu, 8 Nov 2007 10:22:09 -0600 (CST)
From: Al Hopper
To: Marc Hamilton
Subject: 100k download - hard to believe

Hi Marc,

I saw your recent blog[1] and the number you quoted for Project Indiana downloads (100,000) does not look reasonable to me. A "back-of-the-napkin" calculation reveals that for 100,000 downloads of 660226048 bytes per iso image, delivered over 72 hours, you'd be pushing over 2Gbits/Sec to the 'net. From a quick test of dlc.sun.com, it looks like you've got a 5Mbit/Sec cap on your connection (my best techguess).

I had an earlier email "conversation" with Jesse Silver - where he asked me for our genunix.org stats[2] and quoted his download numbers. I expressed scepticism that his numbers were accurate - suggesting that they may have counted the number of download transactions from the http access logs, rather than accumulating a count of the bytes transferred per in-preview.iso transacation and dividing the result by 660226048 (the size of the iso image). As of about 1 hour ago, we've shipped 808 copies of in-preview.iso.

I would suggest that you update your blog ASAP.
Comments welcome.

[1] http://blogs.sun.com/marchamilton/entry/busy_weekend
[2] we are providing downloads of the in-preview.iso file

--------- end Marc Hamilton email -----------

I did'nt receive any feedback from Marc 28+ hours later - hence this blog.
Should inaccurate hype be allowed to go unchallenged? What do you think?

PS: screen capture of Marc Hamilton blog as of Fri Nov 9th 19:28 Pacific

Sunday, September 30, 2007

Setup ZFS boot for Build 72

This cheat sheet will use a very simple and minimal PXE boot to help you setup a machine with ZFS boot and SXCE build 72 (and later). In all, from scratch, you should be able to complete this entire process in about one hour! We make the following assumptions:

  • the install server is on the install network at 192.168.80.18
  • the install server is using a ZFS based filesystem with a pool called tanku. The users home directory is also in this pool at /tanku/home/al
  • the target machine has ethernet address: 00:e0:81:2f:e1:4f
  • there are no other DHCP servers active on the install network
Verify that your ethernet interface supports PXE boot. Most systems do - except for low-end ethernet cards that don't have an option ROM. Determine the ethernet address of the interface you'll be using for PXE boot. Make a note of this address.

Download Lori Alts/Dave Miners ZFS boot tools:
wget http://www.opensolaris.org/os/community/install/files/zfsboot-kit-20060418.i386.tar.bz2

Yes - the date should be 20070418. Unzip and untar them - in this case they'll end up in /tanku/home/al/zfsboot/20070418(aka ~al/zfsboot/20070418)
cd
mkdir zfsboot
cd zfsboot
bunzip2 -c zfsboot-kit-20060418.i386.tar.bz2 | tar xvf -
Notice that the directory name has been changed to 20070418. Find and read the README file. But don't spend too much time studying it. This cheat sheet will tell you what to do.

On the install server setup a ZFS bootable netinstall image for b72
mkdir /mnt72
chown root:sys /mnt72
chmod 755 /mnt72
# FYI only: /solimages is an NFS mount
lofiadm -a /solimages/sol-nv-b72-x86-dvd.iso
Assumes that lofiadm returned "/dev/lofi/2"
mount -F hsfs -o ro /dev/lofi/2 /mnt72
zfs create tanku/b72installzfs
zfs set sharenfs='ro,anon=0' tanku/b72installzfs
cd /mnt72/Solaris_11/Tools
./setup_install_server /tanku/b72installzfs
cd /tanku/home/al/zfsboot/20070418
The next step takes around 13 minutes (why?)
ptime ./patch_image_for_zfsboot /tanku/b72installzfs
Remove the DVD image mount and cleanup
umount /mnt72
lofiadm -d /dev/lofi/2
Verify that you can mount /tanku/b72installzfs on another machine as a quick test. Best to check this now than try to trouble shoot it later. Use a mount command similar to:
mount -F nfs -o ro,vers=3,proto=tcp 192.168.80.18:/tanku/b72installzfs /mnt
Now cd to the Tools subdirectory in the prepared zfs boot area - in this case /tanku/b72zfsinstall
cd /tanku/b72installzfs/Solaris_11/Tools
Generate the target client files:
./add_install_client -d -e 00:e0:81:2f:e1:4f -s 192.168.80.18:/tanku/b72installzfs i86pc
You'll see instructions to add the client macros (something) like:
If not already configured, enable PXE boot by creating
a macro named 0100E0812FE14F with:
Boot server IP (BootSrvA) : 192.168.80.18
Boot file (BootFile) : 0100E0812FE14F
Using the screen-by-screen guide at http://www.sun.com/bigadmin/features/articles/jumpstart_x86_x64.jsp
starting at step 5 entitled Configure and Run the DHCP Server , setup the DHCP server and add the required two macros. NB: Ignore everything up to step 5. You don't need any of it!

At step 5.n, "n. Type the number of IP addresses and click Next." you should consider adding more than two addresses, in case something else on this network (unexpectedly) requests a DHCP lease.

Now add the two macros and use the name 0100E0812FE14F Note Well: the macro must have the correct name. Verify that the tftp based files are available. Again - a quick test now will save you a bunch of trouble shooting time down the road.
df | grep tftp
It should look something *like* this:
/tanku/b72installzfs/boot    260129046 3564877 256564169     2%    /tftpboot/I86PC.Solaris_11-2
Test that the tftp files can be successfully retrieved via tftp:
$ cd /tmp
$ tftp 192.168.80.18
tftp> get 0100E0812FE14F
Received 134028 bytes in 0.0 seconds
tftp> quit
Don't forget to cleanup:
rm /tmp/0100E0812FE14F
Enable FTP on your boot server to allow snagging the zfs boot profile file:
svcadm enable ftp
Change your password before you dare use FTP. Remember to use a disposable password - because it can be sniffed on the LAN. After we're finished using FTP, restore your original password.

Now enable the PXE boot on the target systems BIOS
Boot the target system.
During the early phases of booting press F12 ASAP

You should see the machine contact the DHCP server and start downloading the required files within a couple of Seconds.

NB: verify that the ethernet address displayed by the PXE code is the same one you expected and is associated with the physical interface in use. Some machines pick the ethernet port that will be used for PXE boot for you - you simply don't have a choice. Newer BIOSes allow you to enable PXE separately for each supported interface. Expect to see a GRUB prompt for the release you're installing (i.e., b72)

There is a known bug with build 72 that you might encounter when the target machine contacts the DHCP server. If you see something similar to:
Alarm Clock

ERROR: Unable to configure the network interface
exiting to shell
Then you've hit bug 6598201 The workaround is to simply enter ^D (control-D) in the terminal and the process will continue as if nothing had happened.

Select 4 (Console install) - it's the least likely to cause you issues. If you're using bge0 as the PXE boot interface, ensure that you leave the bge0 interface enabled for networking "[x] bge0" - otherwise you won't be able to "see" the install server. Fill in the minimum required config details, and take the first Exit option as soon as you see one. Now you should be looking
at a command line prompt.

The following assumes you've setup a profile file called (simply) profile.zfs on your boot server. See samples below.

At the prompt:
cd /tmp
ftp 192.168.80.18
user:
password:
(use the dummy login/password you setup earlier)
get profile.zfs
quit
Now load the system with pfinstall:
pfinstall /tmp/profile.zfs
The system should begin loading Solaris within a couple of Seconds.

Sample ZFS boot profile #1 (simple).
You may wish to change the cluster type to SUNWCXall (see next sample)
install_type initial_install
cluster SUNWCall
filesys c1t0d0s1 auto swap
pool mypool free / mirror c1t0d0s0 c2t1d0s0
dataset mypool/be1 auto /
dataset mypool/be1/usr auto /usr
dataset mypool/be1/opt auto /opt
dataset mypool/be1/var auto /var
dataset mypool/be1/export auto /export

Sample ZFS boot profile #2 (more complex).

Note the subtle change in the cluster name in this sample. We will load all the available locales by using the geo keyword. This will almost double the required install disk space. Instead of the C default system locale we'll make the system default be en_US.UTF-8.
install_type initial_install
cluster SUNWCXall
system_locale en_US.UTF-8
geo N_Africa
geo C_America
geo N_America
geo S_America
geo Asia
geo Ausi
geo C_Europe
geo E_Europe
geo N_Europe
geo S_Europe
geo W_Europe
geo M_East
filesys c1t0d0s1 auto swap
pool tanks free / mirror c1t0d0s0 c2t0d0s0
dataset tanks/be1 auto /
dataset tanks/be1/usr auto /usr
dataset tanks/be1/opt auto /opt
dataset tanks/be1/var auto /var
dataset tanks/be1/export auto /export
With cluster SUNWCXall and no additional geo regions, you should be ready to reboot in 7 to 10 minutes. Now reboot the machine gracefully:
init 6
That's it! Your machine should now reboot successfully.
Enjoy!

PS: Don't forget to change back your password and disable FTP on the install server. If you're going to reboot the install server, remember to remove the /etc/vfstab entry for the /tftpboot - or the machine will not boot cleanly to run-level 3.

Friday, June 17, 2005

Genunix.Org is Alive



genunix.org equipment with the N2120 staged for the camera only


Take a look at GenUnix.Org There's not much content there now, beyond a mirror of the OpenSolaris launch files and some video from the first Open Solaris User Group meeting; but that'll change in the future. Cyril Plisko has an operational SubVersion (SVN) source repository hosted at the site.

How genunix.org got started (Part 1 of 2)

Early in May, I got the idea to host an OpenSolaris Community/Mirror site. First off was to leave a message for Paul Vixie of Internet Systems Consortium - because I know that they currently host kernel.org and a bunch of other, successful, Open Source projects. I wanted to add OpenSolaris to that list.

Within a week I had been contacted by Peter Losher and we got an OK to proceed. I could hardly believe it - access to a clean one gigabit connection to the internet with the rackspace, power, cooling and bandwidth sponsored by ISC.

Next I needed to scrounge up some equipment. We (at Logical Approach) decided to sponsor the site with a maxxed out V20Z: two 146 gigabyte drives, 8 gigabytes of memory and two AMD 252 (2.6GHz) Opteron processors. This would ensure that a site would go online and indicate our committment to this project. However I was reluctant to bringup the site to support the upcoming launch of OpenSolaris, with just one server. I wanted high performance .... but also realized that high reliability and high availability were primary requirements.

So I put together a generic technical spec - generic in that it described the basic architectural building blocks of the site, but did not specify vendor specific part numbers or detailed configuration. The spec. also broke down the equipment into two procurement phases, which were called a Starter System Configuration and an Enhanced System Configuration. This would allow the site to go online with the starter config and, later, to be expanded to the enhanced config. Here is what the top level generic spec looked like:

Starter System Configuration Overview
  1. Server Load Balancer (aka Application Switch) standalone appliance with:
  • 12 * gigabit ethernet ports configured
  • - 2 * optical ports to connect to the ISC infrastructure
  • - 10 * copper UTP ports to connect to the web servers
  • 2 * A/C power supplies
  1. Four 1U dual AMD Opteron based rackmount servers configured
  • 2 * AMD Opteron 252 (2.6GHz) CPUs
  • 8Gb RAM
  • 2 * 146Gb U320 SCSI disk drives
  • 2 * built-in copper gigabit ethernet ports
  • 1 * dual-port gigabit ethernet expansion card
Enhanced System Configuration Overview
  1. One Fibre Channel (FC) SAN disk subsystem configured
  • 12 * 146Gb Fibre Channel 3.5" disk drives
  • 2 * RAID Controllers with 1-GB Cache Each and battery backup
  • 4 * 2Gb/Sec FC Host ports
  • 2 * A/C power supplies
  1. Four Fibre Channel Host adapters
  • PCI 64-bit low profile form factor
  • 2Gb/Sec Optical LC connectors
  • 2m Optical cable
As you can tell, the reliability/availability comes from using a Server Load Balancer (SLB) aka Application Switch, to load balance incoming requests across multiple, backend, servers. The load balancer issues periodic health checks and, assuming all 4 servers are healthy, the requests will be distributed according to the selected load balancing algorithm to the available servers in the defined pool. The real beauty of this approach, is that you can also do scheduled maintenance on any of the servers by "telling" the SLB to take a particular server out of the available pool. You wait until all active sessions expire on the server, then disconnect it. Now you are free to ugrade or repair it. Lets assume you're upgrading the Operating System. After you've completed the upgrade, you have plenty of time to test exhaustively, because the other servers in the pool are serving your client requests. When you've satisified that the upgraded server is ready for production, simply tell the SLB to put it back into the pool. Your user community experiences no impact and are completely unaware that you've just upgraded a server.

This architecture is also cost effective - because you consider each server as a throw away server. I don't mean this literally. Each server can had a single power supply or a single SCSI buss, or non-mirrored disks - because if it fails, it will have little impact on the service you're providing. This is in stark contrast to using high end (read expensive) servers with multiple power supplies, multiple disk subsystem busses and mirrored disk drives.

Next the generic spec was translated into a detailed vendor specific specification, including a parts list. Of course I preferred that Sun would provide hardware sponsorship - so there was a little Sun bias in the original generic spec. For the servers, I really wanted to use the Sun V20Z - it's an awesome server based on the AMD Opteron processor and runs Solaris based applications with impressive speed and efficiency.

I ran the spec by the other members of the CAB as a sanity check. No feedback = good news. Next I presented it to Jim Grisanzio and Stephen Harpster. Initially I got a No - for various reasons. Then Simon Phipps (also a CAB member) told me to forward the proposal to John Fowler.

In the meantime, I was busy upgrading Logicals' V20Z with the required new CPUs, expanded memory capacity and a couple of 146Gb disk drives. Unfortunately the new CPUs were not compatible with the existing motherboard or Voltage Regulator Modules (VRM). The V20Z uses a separate VRM for the CPU and memory. The Sun 252 processor upgrade kits, came with the required VRMs - so that was not an issue. But the included documentation indicated the requirement for a revision K2.5 motherboard, or, in Suns terminology, the Super FRU Chassis assembly, where FRU means Field Replacable Unit. Since this was a Sun supplied upgrade, I called Suns tech support and explained the issue. In less than an hour I had a case number and was told that a replacement motherboard would be dispatched.

It takes about one hour of careful work, to strip your existing motherboard and "transplant" the parts [1] to the replacement. And then about 10 minutes to install the new CPU, heatsink, CPU and memory VRMs. It helps if you are comfortable working on PC hardware - if not, I'd recommend that you find someone who is. One (big) advantage of the updated motherboard is (IMHO) quieter speed controlled fans and support for DDR400 memory parts (with the upgraded CPU).

On June 1 an email arrived with the news I had been awaiting. Bill Channel now had my request, via John Fowler, for Hardware sponsorship and he was ready to get started on making this happen! :)

The hardware was scheduled for delivery on Monday June 6th.

Note [1]: CDROM/floppy assembly, SCSI backplane, PCI risers, Power Supply, SCSI backplane cable assemblies, daughter board with keyboard/mouse connectors, memory, disk drive(s).


Continued in Part II.

How Genunix.Org got started (part 2 of 2)













Peter LosherAl Hopper







Ben Rockwood

It's 16 hours before the public launch of OpenSolaris as I write this paragraph and I'm getting really excited, but I'm also really tired. I've been working furiously to try to get a community run OpenSolaris site online in time to support launch. The actual hardware did'nt arrive at Logical Approach until late afternoon on Monday the 6th of June. The site hardware consists of 4 Sun V20Z servers in a maxxed out configuration - two 146 gigabyte drives, 8 gigabytes of memory and two AMD 252 (2.6GHz) Opteron processors. Three of the V20Zs and an N2120 Applications Switch (aka Server Load Balancer (SLB)) were sponsored by Sun, the 4th V20Z was contributed by our company, Logical Approach.

Now if I did'nt have a day job at Logical - having this hardware arrive on Monday afternoon, with a scheduled install in an internet co-location facility 1,700 miles away on the Friday of the same week, would be doable. A push, yes, but doable. But, unfortunately we have our customers to look after and that week was pretty busy around here - aside from the ongoing Community Advisorary Board (CAB) activity and trying to keep up with the OpenSolaris Pilot program and mailing lists that were becoming increasingly noisey as we approached launch.

We put off the planned Friday hardware install in Palo Alto, until Saturday, shipped out 3 of the machines for overnight delivery on Thursday and then shipped out the 4th V20Z and the Application Switch on Friday for Saturday (next day) AM delivery. You don't want to know what our FedEx bill looked like!

But everything did'nt go smoothly. In fact we got stymied by the Application switch configuration process. Now, you may already know this; but Load Balancers, as a class of tech toys, are complex devices. That is just the nature of the beast. But, unfortunately, moving from one load balancer to another is like moving from one country to another foreign one; almost everything you learned previously is instantly obsoleted - including the language. The terminology changes completely. The menu system changes; the order of configuration setup steps change. In short, you may even feel that your previous experience (with Foundry Networks ServerIron and Alteon Websystems (now subsumed by Nortel)) seems more like a curse, than a blessing. Again, that's the nature of the beast.

So I raised my hand for help (from Sun) on Friday around midday (Central Time). Yep .... that's a great time to ask for help! And finding the right person at Sun can be daunting, especially within such a large organization. It turns out that the N2120 Application switch is a product Sun acquired when they bought Nauticus Networks. It also happens, that the right person to help configure the switch was off getting married. How inconsiderate of him (just kidding)!

So we shipped the switch in a state of less than digital nervana, overnight to Palo Alto and I was on the first flight on Saturday morning, departing from DFW (Dallas Fort-Worth) for San Francisco. The flight was great - the fun started after the plane landed. I had to check a roller-board type case, because it contained hand tools that would have been confiscated if I had tried to bring it onboard as carry-on baggage. It also contained a bunch of CAT-5 and CAT-6 ethernet cables - so it would very likely be given close scrutiny and hand checked.

After the flight landed at SFO, it took about 40 minutes for that bag to make it to the baggage area! Now you know why everyone uses carry on baggage and all the storage space inside the cabin gets exhausted on most flights. Next up: Budget car rental. The first thing that was easy to see is that the entire car rental area at the airpot was mobbed out. I stood inline for about 30 minutes, got the paperwork done and then the lady helping me had an argument with someone responsible for getting cleaned cars ready for pickup in the nearby parking garage. She slams down the phone angrily. Another woman cuts across me and speaks sternly to her asking her "Why do I have to wait for my car". The answer - because it had to be moved and cleaned. The woman looked at me and said "I don't know why I have to wait..." and hurried away still muttering. I guess she was in a rush to continue waiting. I asked if it would help if I could get a dirty car. "No" came the response; along with a look that told me to quit while I was still ahead. In the meantime I gave my ISC contact, Peter Losher, a heads up message that I'm going to be late at the Colo. He had already begun his drive there from his office, after I told him I was in the car rental line. I was driving out of the car rental garage about 50 minutes after first getting in line! How is that for service! :(

So I get to the CoLo in record time. It's possible, that I may have exceeded the legal speed limit on the drive there. Luckily I did'nt have a law enforcement officer confirm whether I did or did not. So I'll admit to the possibly only! :) I arrived at the CoLo close to 1:30 PM - a far cry from the 11:00 AM planned time. Here I met up with Ben Rockwood, the other person crazy enough to try to make the community site happen on such a tight deadline.

So what happened with the App switch I hear you ask? Well, help did'nt arrive in time and we shipped it at the last possible moment on Friday evening. Luckily, and thanks to FedEx, everything made it to the CoLo and was waiting for us when we arrived. We immediately got down to work. The racks were too shallow for this current generation of 1U gear - which Peter rightly says "grew in depth" to make up for what it lost in height! We had to mount the equipment on shelves. We had to move some shelves to get the ones deep enough for the cabinet we were assigned. When it was time to mount the App switch, we could not find any more deep shelves, so we had no option but to mount it, on an available shallow shelf, in a neighbouring rack. Before we did that, I pervailed on everyone present to mount a shallow bracket in the same rack as the V20Zs and temporarily mount the App switch, with a human holding up the other end, so that we could get a group shot of all the equipment front panels. In the pictures you see, the 3rd person is actually holding up the rear of the App switch so that the other 2 people (person in front of the camera and person behind the camera) can snap a photo with all the equipment front panels visible.

We also had a funny incident, where Ben, in his rush to get the job done, applied a mounting rail to the wrong side of a V20Z. This was before we figured out that we had to mount the servers on shelves. Since it was mounted backwards, it's locking mechanism locked the rail in place and then became inaccessible. It proved very difficult to remove. This was one of those embarassing moments that anyone would rather forget. Bens' ultimate solution, however, was ingenious. He removed the end tab from his tape measure and used it to slide down, in the narrow space between the rail and the computer case, until he was able to poke at the locking mechanism and release it. Later that day, after using the tape measure to measure rack spacing, the spring loaded retraction mechanism gobbled up the tape and it disappeared inside the case forever - since it no longer had the tab which would normally prevent that happening!

We ended up being practically thrown out of the CoLo by 4:00 PM - our work complete. This was unfortunage, because I had planned on spending considerable time at the CoLo, but Peter was supposed to be somewhere else at 3:00 PM and we had to leave.

The CoLo cage is a difficult environment to work in; it's noisey and you're constantly bombarded with hot and very dry air. We yelled at each other the entire time and I ended up dehydrated - because of the dry air and not drinking much water that day. We also missed a meal - and that took its' toll on us all. Especially since we'd all been working crazy hours that week. Peter was in the worst shape - he was still recovering from a really bad case of the flu.

Without the App switch, the original, carefully designed site architecture design had to be discarded. Also, we pretty much had to redesign it on the fly - so that if App switch config help arrived on Monday, we would still be able make use of its load balancing capabilities. So some of the server ports were connected up to the app switch and others were connected to Peters new HP 2824 gigabit switch. The IP addressing plan was obsoleted - since the App switch has NAT (Network Address Translation) facilities, and without it, we had no NAT capability. There were other features provided by the App switch which were also part of the design, which I won't go into. We barely had enough time to configure routeable addresses on the SP (Service Processor) management ports and set them up. That was to be our point of entry into the system to make the other changes that were required without the App switch - since the servers were configured per the original design (with the App switch present) while they were on the bench.

On Day 2 (Sunday) of the install I prevailed on Peter to allow me to work in the CoLo, but had to promise him it would only be an hour, maximum. I made the best use of the hour, doing a lot of tidying up work that should have been completed the day before. We also discussed the ISC network topology and peering and made plans to have ISC host our DNS records, at least initially. Also we got the App switch console wired into a terminal server. But setting up remote access to the terminal server, and hence accessing the App switch console port, would be deferred until later that day. Also populating the DNS data would be left until Peter was in a much more hospital work environment - his office.

Both Ben and I assumed we would have as many hours as we needed at the CoLo - but that was not to be, because the policies in place, demanded that we be shaparoned. This was a big factor in the issues that plagued us later. We were simply too rushed and did'nt have time to check then double check, our work. After leaving the CoLo I jumped an earlier flight and ended up back in the DFW area around 11:00 PM. Perfect. On the way home from the airport, I know I exceeded the speed limit. I had a police officier catch up with me about a mile after he first saw the car, and tell me my exact speed at the time: 82 MPH. The following morning I started the work of finishing up the machine configs and making the changes mandated by the lack of the App switch. I hit a "minor" (yeah right!) problem. I could connect to the V20Z SP (Service Processor), but I could not see a console login! I could not access the Operating System. Initially I took this in stride. Knowing that I had only played with the SP facility on the V20Z previously - it was just a case of reading the manual and figuring it out, or so I thought.

Meantime there were other fires to extinguish. I was still seeing DNS errors - our new domain, genunix.org, was not resolvable. Also I did'nt know how to access the App switch console port remotely. A couple of calls/emails to Peter and he assured me that all would be well with DNS within 30 minutes and that we'd be able to get to the App console port soon. In the meantime I had an email from a gentleman at Sun offering support for the App switch. He emailed very early that morning (around 7:00 AM). But we did'nt have console access to the App switch until around noon. In the meanwhile I had sent him a scaled back App switch specification and details of our (modified) addressing scheme. I told him that I was looking to gain access to the servers that were already (physically) connected to the App switch, but had been unable to figure out how to put the required switch ports in the correct VLAN, and enable them. His first issue was not being able to resolve the name for the unix box that ultimately gains us console access to the switch serial port. Then, after he got the IP address, he could not connect to it (no route to host) from his office. Obviously their office internet access (DNS and routing) is totally foobarred. Last I heard he was working the issue (fire up a GPRS cell modem on his laptop or go home & use his home DSL), but it turns out he was unable to do either. So, I've been hacking on the App switch since about 4:00 PM (CDT) while Ben Rockwood took a fresh look at resolving the "no console access via the SP" issue - using any resources he could find online. We've probably hit an SP bug, in that the factory config (BIOS, SP code and OS) etc. won't allow console access out of the box.

On the App switch, I got to the point where I could ping one of the private interfaces on the servers, but the switch does not appear to have an SSH client or a telnet client. The App switch specialist was able to confirm that it does not have SSH client capability. I came to the realization that it did'nt have a telnet client all on my own. After I had achieved the ability to ping a server and I went looking for the telnet command in the menu system. It does, however, have an SSH daemon and a telnet daemon. How strange. So by 6:00 PM Central, after 2 hours (wasted) on hacking on the App switch config I realize that access to the servers via the App switch had reached a dead end.

Meanwhile Ben is trying every trick he can think of to get around the SP to console login issue and making use of every resource he could think of, on the 'net. By about 8:15 PM or so this avenue was not yielding any results. So I talked with Ben and then reported the bad news to the OpenSolaris Pilot community and several people (via a CC list) that I had made promises to. It was bitterly disappointing.

Now it's 12:35 AM (CDT) on Tues the 14th: OpenSolaris Launch Day. Stories have already been posted and I got a call about 15 minutes ago from Ben who says he has been successful in making arrangements to gain physical access to the CoLo. Pretty amazing. I was about to turn in ... but he'll need some help to test (now that's a novel idea, is'nt it!) from the outside. So I'm working this blog while waiting and then we'll see what we can get done.

1:00 AM (CDT): I get a call from Ben and we've in business thanks to his heroic efforts and the incredible co-operation of the ISC folks. So now we _start_ working the machines.

4:15 AM (CDT) and I just emailed Cyril Plisko and let him know that the SVN repository zone and logins are ready for his use - as promised. He has his own zone on the machine, called svn, and the fully qualified hostname is svn.genunix.org. He just logged in and I'm checking with Ben R to see if there is anything he needs help with.

4:30 AM: I get some sleep while Ben continues to finish up the "starter" site.

8:15 AM (CDT) - One hour and 45 minutes before launch: I send an email to Derek Cicero telling him that we're ready for content. The subject line reads: "Rabbit emerges out of hat". After receiving a reply with a URL on where to get the content, I wake up Ben (by phone) and send him Dereks' URL. Minutes later, there's content available on genunix.org! :)

9:00 AM (CDT): I get on the Sun launch conference call and help out where possible. It was a great opportunity to "live" the launch event. I continued to watch our site (www.genunix.org) and do some cleanup & further testing on it. The press releases fly and the file downloads begin to fly, everything goes very, very smoothly.

10:00 AM (CDT) Official Launch Time: Looking at /. (slashdot) I see that OpenSolaris gets a comparitively easy introduction to the world. Nothing much of a dreaded Linux jihad emerges - thankfully.

Later in the morning I settle into my "Day Job" and interact a little on IRC and the OpenSolaris mailing list. It was your typical Monday. BTW: I hate Mondays!

So lets summarize what went wrong
  • not setting realistic expectations and timelines
  • not anticipating air travel hassles
  • budgeting a 36 hour window of CoLo time; getting 3 1/2 hours
  • lack of testing
  • working a project while fatigued. Fatigue will allow you to make silly errors and not catch them. I know this from my pilot training.
  • delegating a simple task that looks like it can be done easily/quickly (DNS) and then failing to recognize that its not a viable strategy.
  • using non "main stream" equipment that you're not familiar with.
  • assuming, that if you can "talk" to the SP, you can get a console login.
  • and did I mention lack of testing?

And lets summarize what went right
  • we pushed out 90 gigabytes on content on launch day.
  • we saw incredible cooperation from many, many exhausted individuals.
  • we saw Sun Nack, Ack and then deliver on hardware sponsorship in 18 days. Thats pretty incredible for a company as large as Sun.
  • we got incredible co-operation from Peter and the folks at ISC.
In particular Peter worked really hard while still recovering from a bad dose of the 'flu and dealt with other disruptive events, like his laptop disk drive going on the blink and having to be replaced just before he went out of town on an important install.

There were many heroes in this tale. I've already mentioned some and I apologize for those I have left out. PS: send me email if you want something included and I'll update this book.

Saturday, January 22, 2005

Linus must step aside



Have you noticed that company founders or leaders often give up their pivotal role in a company that they have founded or been instrumental in leading? They either step aside or are forced out. Why is this? Because, in order for the company to continue to make progress and grow, they need to step aside. The smart ones recognize this - the dumb ones..... oh well.

So lets examine this phenomenon. The people we are talking about usually share the following character traits: They are brilliant, very talented, visionary and very demanding to work for. These character traits are what makes them different and allows them to create a company or product that others are incapable of. Those are the upsides. But there are corresponding downsides.

They are usually a royal pain in the ask to work with. Highly opinionated, very judgmental and apt to be very stubborn. They can also be inflexible. Again the smart ones recognize their weaknesses and surround themselves with other talented folk who help to balance out their personalities. It's not uncommon to find company partners with very different personalities and styles. And the dumb ones ... well no one can possibly be as smart or as talented as they are, and there's nothing wrong with their personalities anyway - so why even bother "playing well with others"!

So here's my point: Linus Torvalds must step aside and let Linux flourish. Linux has reached the personal limitations of Linus - it's creator and mentor. It's currently limited by the mental boundries and personality of its founder. Oh and yes - it appears the Linus does not recognize this problem and does not understand, that he has to step aside.

Lets take look at a serious limitation of Linux (the OS) which is a direct result of the limitations of Linus: Lack of a stable kernel API. According to Linus - having rigid APIs would limit the creativity of the kernel developers. Well ... yes it would, but it would also bring some decipline to the kernel code and it would allow a driver developer to deliver a device driver that does not have to be re-written every time the APIs change. It would also stop hundreds of developers from constantly rewriting and retesting their code every time the APIs change. It would also force the kernel developers to think with their minds and not with their keyboards!

But is this doable? Can the kernel APIs remain stable and not stifle developer creativity? Answer: Yes and yes. Look at Solaris 10 and the DTrace facility. Over 40,000 tracepoints in the kernel with negligible impact on performance, and yet, the tens of thousands of lines of code that I've written, going back to Solaris 2.5 and earlier, still run on Solaris 10 without any changes! And the same code runs on SPARC and Solaris x86 - with just a simple recompile. Time is money - and just think of the dollars involved by not having to constantly rewrite and retest Solaris based code.

On the flipside Linux has one thing going for it that Solaris does not have - a vibrant and active volunteer "army" of developers. But that's about to change when OpenSolaris goes live later this year. I'm a member of the OpenSolaris Pilot program and it's interesting and exciting to be perusing the crown jewels of Sun ... Solaris source code. Just think of it; you're looking at the fruits of the labors of hundreds of man years of effort from some of the most talented developers on the planet. Awesome.

So step aside Linus - or be run over by the OpenSolaris juggernaut.