Recently in outage Category

network outage of 1 hour

| | Comments (0)
There was a network outage this afternoon from 13:20 to 14:16 PST and the 216.218.223.64/26, 216.218.210.64/27,  and 64.62.205.192/26 subnets were affected. The cause was a failed switch with LaFrance Internet Services, our colocation provider at Hurricane Electric and the switch has been replaced.

never again

| | Comments (0)
will I schedule co-lo work on a friday night.  Bad idea.  I'm tired from the work week, and I haven't had time to properly prepare and test everything.  Downtime was a little over 3 hours.  I'll figure out how that fits into the SLA  in the morning. 

But yeah.  Also, I'm never again going to buy a server without hot-swap drives.  this would have gone considerably faster with hot swap drives.

And furthermore, I am going to abandon the more complex mirrored LVM setup I currently have in favor of using md to do the mirroring and using LVM on top of my MD devices. 


downtime should be under 10 minutes.  you are being moved on to the new server, which means that our new,  low-priced plans are available.  (I'm not done tweaking the prices/ram, I will probably even out the curve a little, but the price per megabyte of ram won't be going up.) 

I will be removing two of our legacy servers to make room for our new server.   (one of them doesn't have customers on it, only customers on hydra should be impacted by this.) 

It has begun,  hind is going down now.  After that, hydra
Late Friday night, I'm going to head down to the co-lo and replace the failed drive on Boar's mirror.  If this task goes well, customers will notice nothing more than a short period of unreachability. the save/restore should put them back where they were without a reboot. 

ok starting now

ok, bad drive is samsung w/ serial s13Uj1nq108089

always mark serials before replacing drives... this much I have learned.


I'm moving all the prgmr.com servers at he.net into one rack. Irritating, I know, but it makes server management much easer, and I'll have more control over the network.

Unless I screw up the dhcp server (like I did last time) you should not notice anything more than being unreachable for 30 minutes. The server should automatically save and then restore your domain, with all programs running. If you get rebooted, I screwed something up.

rebooting hydra and lion

| | Comments (1)
if I do it right, all you will notice is 5 minutes of inactivity, as Xen is configured to save all the DomUs to disk.  (much like how you can  hibernate your laptop)

In the spirit of "only make each mistake once" we are installing serial consoles after the problem on boar the other day.   We will also be upgrading dom0 kernels to the centos latest.

http://book.xen.prgmr.com/mediawiki/index.php/Serial_console

Coloma reboot.

| | Comments (0)
We're going to need to reboot Coloma for maintenance and troubleshooting.  I'm targeting midnight tomorrow, PDT -- approximately 24 hours from now.

We expect only a few minutes of downtime.  After that you should be able to log in and restart your VM normally.  Unfortunately, the problem that's provoking the reboot seems to make console access impossible for now.

unplanned reboot of coloma

| | Comments (0)
Coloma is my old i386-PAE box.  dual xeons in a supermicro chassis.   kinda, well, old. 

There were three problems.   First, I let the userland xen tools get out of sync with the kernel.  (uncontrolled yum update is not a good thing)    Second,  on this old box I never tested the 'save domains on reboot' functionality (on the new servers, if I reboot the dom0, it does an 'xm save' on every running DomU, and an 'xm restore' upon reboot, meaning that rather than seeing a reboot, the DomU owner might notice that the DomU was unavailable for 5-10 minutes, but everything that was running on it before was still running-  it would be like unplugging the ethernet cable for a while and plugging it back in)    The third (and perhaps largest) problem was that I rebooted the server to deal with the first problem without scheduling it  (would have *maybe* been acceptable (but not good) on the new servers, but on the old ones, this was a mistake.) 

I'll schedule things better in the future. 
The server is going through burn-in as we speak

as I mention on the main page, we ran out of space the other day.    we are putting in a new server, boar, and one of my ancient catalyst switches, with 'port monitor' or SPAN capabilities,
so I will be un-breaking the bridge on lion, and bandwidthD and my inward-facing IDS will both continue to function. 

This will require us to physically re-configure the network (just moving cables-  if we don't
screw it up, downtime should be less than 60 seconds-  no reboot or anything,  just a few dropped packets.)
there was a problem with my upstream provider on tahoe and coloma (where all current customers are)  causing us to intermitently drop packets.   Rippleweb.com doesnt' have info on the outage on their website yet. 

About this Archive

This page is a archive of recent entries in the outage category.

new features is the previous category.

security is the next category.

Find recent content on the main index or look in the archives to find all content.