Recently in outage Category

whetstone reboot

| | Comments (4)
We needed to reboot whetstone this morning because of
BUG: soft lockup detected on CPU#0!

Call Trace:
 <IRQ> [<ffffffff8025894a>] softlockup_tick+0xce/0xe0
 [<ffffffff8020df6c>] timer_interrupt+0x3a8/0x402
 [<ffffffff80258c34>] handle_IRQ_event+0x4e/0x96
 [<ffffffff80258d20>] __do_IRQ+0xa4/0x105
 [<ffffffff8020bd6c>] do_IRQ+0x44/0x4d
 [<ffffffff80351f4c>] evtchn_do_upcall+0x19e/0x256
 [<ffffffff80209d8e>] do_hypervisor_callback+0x1e/0x2c
 <EOI> [<ffffffff8035d93e>] show_rd_sect+0x0/0x68
 [<ffffffff802ee0bc>] __read_lock_failed+0x8/0x14
 [<ffffffff803494de>] get_device+0x17/0x20
 [<ffffffff804024cd>] .text.lock.spinlock+0x53/0x8a
 [<ffffffff8035d965>] show_rd_sect+0x27/0x68
 [<ffffffff802be588>] sysfs_read_file+0xa5/0x12c
 [<ffffffff8028031c>] vfs_read+0xcb/0x171
 [<ffffffff802806fb>] sys_read+0x45/0x6e
 [<ffffffff802097b2>] tracesys+0xab/0xb5

We have seen this before on some of our other dom0s so we're planning to upgrade them eventually to xen 4 if they have this problem. The downtime lasted 6 hours, users on whetstone will get a free month.

robe network driver

| | Comments (0)
Robe was down since yesterday with the too many iterations (6) in nv nic irq rx problem in the forcedeth ethernet driver. The problem was unrelated to the DDoS attack yesterday. Users on robe will get an additional free month.

crock reboot again

| | Comments (0)
Crock needed to reboot again because the dom0 kernel hanged, with the same error:
BUG: soft lockup detected on CPU#0!

Call Trace:
 <IRQ> [<ffffffff8025758a>] softlockup_tick+0xce/0xe0
 [<ffffffff8020df48>] timer_interrupt+0x3a0/0x3fa
 [<ffffffff80257874>] handle_IRQ_event+0x4e/0x96
 [<ffffffff80257960>] __do_IRQ+0xa4/0x105
 [<ffffffff8020bd5c>] do_IRQ+0x44/0x4d
 [<ffffffff8034c980>] evtchn_do_upcall+0x19e/0x250
 [<ffffffff80209d8e>] do_hypervisor_callback+0x1e/0x2c
 <EOI> [<ffffffff803581ea>] show_rd_sect+0x0/0x68
 [<ffffffff802ebbf9>] __read_lock_failed+0x5/0x14
 [<ffffffff80343f3e>] get_device+0x17/0x20
 [<ffffffff803fc3fd>] .text.lock.spinlock+0x53/0x8a
 [<ffffffff80358211>] show_rd_sect+0x27/0x68
 [<ffffffff802bc351>] sysfs_read_file+0xa5/0x12e
 [<ffffffff8027e3f5>] vfs_read+0xcb/0x171
 [<ffffffff8027e7d4>] sys_read+0x45/0x6e
 [<ffffffff802097b2>] tracesys+0xab/0xb5

So we're thinking this is a hardware problem and plan to put crock's disks into a new system that should be more stable.

Knife reboot

| | Comments (0)
Knife stopped responding again (the other time was March 28) and I rebooted it from the hypervisor. We may need to move the disks to a new system.

knife reboot

| | Comments (1)
Last night knife stopped responding, so we rebooted it this morning and the domUs running on it, and the downtime was 12 hours. On the serial console, we couldn't log in to the dom0 and the dom0 kernel didn't respond to a break signal for the magic sysrq. The hypervisor responded to its escape code (crtl-a 3 times), and we were able to reboot the system from the hypervisor.

We don't know yet what caused the problem, but everyone on knife will get a free month. We are also planning to improve our monitoring system with nagios, and maybe a pager system to notify us more effectively.
see http://book.xen.prgmr.com/mediawiki/index.php/Peth0:_too_many_iterations_%286%29_in_nv_nic_irq_rx.   for details.  

hydra rebooting shortly

| | Comments (6)
we're trying to see if we can xm save like we did on lion, unlike we did on boar, but it's a pretty old box, so we might be rebooting you. 

[root@hydra /]# uptime
 18:56:26 up 410 days, 15:44,  2 users,  load average: 0.09, 0.29, 0.25

all servers will be rebooted (as lion was today)  for some kernel upgrades, and to consolidate all my he.net servers to one rack.  
root@lion ~]# uptime
 18:12:12 up 451 days, 17:42,  8 users,  load average: 0.01, 0.03, 0.00


as usual, if we don't screw it up it will be 20 minutes downtime and no reboot for you, due to xm save/restore

prgmr.com network outage tonight

| | Comments (0)
 
I screwed up. I gave two customers the same IP and mac address. this took down the network. the problem should be fixed now

as far as I can tell we were down about two hours.