whetstone reboot

| | Comments (4)
We needed to reboot whetstone this morning because of
BUG: soft lockup detected on CPU#0!

Call Trace:
 <IRQ> [<ffffffff8025894a>] softlockup_tick+0xce/0xe0
 [<ffffffff8020df6c>] timer_interrupt+0x3a8/0x402
 [<ffffffff80258c34>] handle_IRQ_event+0x4e/0x96
 [<ffffffff80258d20>] __do_IRQ+0xa4/0x105
 [<ffffffff8020bd6c>] do_IRQ+0x44/0x4d
 [<ffffffff80351f4c>] evtchn_do_upcall+0x19e/0x256
 [<ffffffff80209d8e>] do_hypervisor_callback+0x1e/0x2c
 <EOI> [<ffffffff8035d93e>] show_rd_sect+0x0/0x68
 [<ffffffff802ee0bc>] __read_lock_failed+0x8/0x14
 [<ffffffff803494de>] get_device+0x17/0x20
 [<ffffffff804024cd>] .text.lock.spinlock+0x53/0x8a
 [<ffffffff8035d965>] show_rd_sect+0x27/0x68
 [<ffffffff802be588>] sysfs_read_file+0xa5/0x12c
 [<ffffffff8028031c>] vfs_read+0xcb/0x171
 [<ffffffff802806fb>] sys_read+0x45/0x6e
 [<ffffffff802097b2>] tracesys+0xab/0xb5

We have seen this before on some of our other dom0s so we're planning to upgrade them eventually to xen 4 if they have this problem. The downtime lasted 6 hours, users on whetstone will get a free month.

But no reboot. I'm rebuilding off a hot spare.

[lsc@branch ~]$ cat /proc/mdstat
Personalities : [raid1] 
md2 : active raid1 sdc2[0] sda2[1]
      478375424 blocks [2/2] [UU]
      
md1 : active raid1 sdf2[2] sdd2[3](F) sde2[4](F) sdb2[0]
      478375424 blocks [2/1] [U_]
      [=>...................]  recovery =  6.4% (30769792/478375424) finish=524.6min speed=14216K/sec
      
md0 : active raid1 sdf1[3] sdd1[4](F) sde1[5](F) sdc1[0] sdb1[1] sda1[2]
      10008384 blocks [4/4] [UUUU]
      
unused devices: 
[lsc@branch ~]$ 

[root@stables ~]# xm create -c billing_e_test
Using config file "/etc/xen/billing_e_test".
Error: Creating domain failed: name=billing_e_test
[root@stables ~]# 

however, domains that are up, are up. I can log into one of mine on that box, and all appears well.

working on it.

everyone should be back up, you can reboot your domains now, too.

dish reboot

| | Comments (1)
We just needed to reboot dish.prgmr.com because of this error:
BUG: soft lockup detected on CPU#0!

Call Trace:
 <IRQ> [<ffffffff8025894a>] softlockup_tick+0xce/0xe0
 [<ffffffff8020df6c>] timer_interrupt+0x3a8/0x402
 [<ffffffff80258c34>] handle_IRQ_event+0x4e/0x96
 [<ffffffff80258d20>] __do_IRQ+0xa4/0x105
 [<ffffffff8020bd6c>] do_IRQ+0x44/0x4d
 [<ffffffff80351f4c>] evtchn_do_upcall+0x19e/0x256
 [<ffffffff80209d8e>] do_hypervisor_callback+0x1e/0x2c
 <EOI> [<ffffffff8035d93e>] show_rd_sect+0x0/0x68
 [<ffffffff802ee0b9>] __read_lock_failed+0x5/0x14
 [<ffffffff803494de>] get_device+0x17/0x20
 [<ffffffff8040415d>] .text.lock.spinlock+0x53/0x8a
 [<ffffffff8035d965>] show_rd_sect+0x27/0x68
 [<ffffffff802be588>] sysfs_read_file+0xa5/0x12c
 [<ffffffff8028031c>] vfs_read+0xcb/0x171
 [<ffffffff802806fb>] sys_read+0x45/0x6e
 [<ffffffff802097b2>] tracesys+0xab/0xb5


diagnostics ongoing.

hamper to be rebooted shortly

| | Comments (2)

replacing more disk

It has come to my attention that prgmr.com does not have a written, publicly accessible privacy policy. Below, I have pasted a first draft. Please give me feedback. Note, I've been editing this draft in place... this is /not/ the final version, I'm just soliciting feedback.

prgmr.com will not release private customer data except in the following cases:

1. in order to comply with ARIN requirements for new IP blocks, we will release
   the name or business name to ARIN. we will be executing the ARIN non-disclosure
   agreement, which requires that ARIN keep your names secret except in the case
   of a court order [1]

2. We will comply with any valid court orders issued by courts that have 
   jurisdiction.

3. we use automated and manual processes to examine network traffic while looking for problems.  

4. we will never examine your disk without permission.   (we may ask you to let us examine your disk or to leave, but if you don't give us permission, we won't examine the disk without a court order.)

5. we may examine network traffic with both manual and automated processes.   the results of this examination won't be shared without a court order.  

6. we may log and examine your serial console while looking for system problems. 




If this document needs to be amended, I will do my best to minimize the impact
on customers, and I will email the address on file with a notice.  If customers
wish to quit a long term contract because of an amendment to this document, any
early termination fees will be waived, and the customer will be given a prorated 
refund based on time used.  
 



[1]https://www.arin.net/resources/agreements/nda.pdf 
[

Data retention is kindof a sticky thing. See, the longer I keep the data, the easier it is for me to spot trends and ongoing problems. but obviously, customers don't want me to keep shit around forever, and without a defined data retention policy, I think it's legally harder for me to tell law enforcement "we don't have that data" when they come knocking.

What if I had a clause that said "I give you access to all data I'm retaining about you at http://blah/customer" - it would be more work for me but it would allow me to have longer data retention (which is good for troubleshooting) without pissing off customers, especially if I add a 'delete this' button... but I don't know where that puts me legally.

of course, that is technically more difficult... but I could release a tool that others could use. (I'd tie the login to the email)

so, for the past few days I've been trying to get my initial IP allocation from ARIN. Here is what they say:

hostmaster@arin.net writes:

> Hello,
>
> Thank you for your reply.  This is close to what we needed but we
> still need you to provide the actual customer name for each IP
> assignment in the list provided please.
>

I called and asked if this was also policy when DSL providers asked for IP addresses that would be statically assigned, and they said this was true of all static IP addresses.

I explained that my existing policy that prevents me from releasing personal information without a court order.

So, for now, I will be buying more IPs from my upstream. Until this is solved, we will not be giving people more than one IP per VPS.

If you are okay with me giving your full name to ARIN (under NDA) please email me. if 1024 of you are okay with that, my problem is solved. Please note, they aren't looking for email, postal address or anything else, just your name.

ugh. all users of hamper will get a month credit. heading down to work on it now.

edit: hamper is back up.

So, hamper had 4 drives (a stripe of mirrors) with a fifth, a spare. many months ago, one of the drives began failing. I removed that drive from the raid, and rebuilt onto the hot spare.

Earlier today, when I was dealing with the DoS, I thought I'd pull the drive and return it for warranty service. Bad idea. the computer siezed up.

It appears that for 8 hours, writes didn't go through to the hard drives. I have reset the drive, and hamper appears functional again. the outage appears to be from 02:36:36 to 20:46:47 PST

Other than the 8 hours of no writes, it appears that there was no data loss. if you are on hamper and are still having problems, please let me know.

This has encouraged me to accelerate my long-talked about backup plan.

we're currently seeing 160 megabits/sec on a 100m pipe, so it's almost certainly a DoS of some sort. we're working on it, and hopefully will get it cleaned up faster than the problem at svtix. (It appears to be unrelated; a different customer.)