<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>Xen hosting: Lessons from the Trenches</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/" />
    <link rel="self" type="application/atom+xml" href="http://wiki.xen.prgmr.com/xenophilia/atom.xml" />
    <id>tag:wiki.xen.prgmr.com,2008-03-02:/xenophilia/2</id>
    <updated>2010-07-26T19:49:26Z</updated>
    
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type Open Source 4.1</generator>

<entry>
    <title>whetstone reboot</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2010/07/whetstone-reboot.html" />
    <id>tag:wiki.xen.prgmr.com,2010:/xenophilia//2.214</id>

    <published>2010-07-26T19:40:23Z</published>
    <updated>2010-07-26T19:49:26Z</updated>

    <summary><![CDATA[We needed to reboot whetstone this morning because of BUG: soft lockup detected on CPU#0!Call Trace:&nbsp;&lt;IRQ&gt; [&lt;ffffffff8025894a&gt;] softlockup_tick+0xce/0xe0&nbsp;[&lt;ffffffff8020df6c&gt;] timer_interrupt+0x3a8/0x402&nbsp;[&lt;ffffffff80258c34&gt;] handle_IRQ_event+0x4e/0x96&nbsp;[&lt;ffffffff80258d20&gt;] __do_IRQ+0xa4/0x105&nbsp;[&lt;ffffffff8020bd6c&gt;] do_IRQ+0x44/0x4d&nbsp;[&lt;ffffffff80351f4c&gt;] evtchn_do_upcall+0x19e/0x256&nbsp;[&lt;ffffffff80209d8e&gt;] do_hypervisor_callback+0x1e/0x2c&nbsp;&lt;EOI&gt; [&lt;ffffffff8035d93e&gt;] show_rd_sect+0x0/0x68&nbsp;[&lt;ffffffff802ee0bc&gt;] __read_lock_failed+0x8/0x14&nbsp;[&lt;ffffffff803494de&gt;] get_device+0x17/0x20&nbsp;[&lt;ffffffff804024cd&gt;] .text.lock.spinlock+0x53/0x8a&nbsp;[&lt;ffffffff8035d965&gt;] show_rd_sect+0x27/0x68&nbsp;[&lt;ffffffff802be588&gt;] sysfs_read_file+0xa5/0x12c&nbsp;[&lt;ffffffff8028031c&gt;] vfs_read+0xcb/0x171&nbsp;[&lt;ffffffff802806fb&gt;] sys_read+0x45/0x6e&nbsp;[&lt;ffffffff802097b2&gt;] tracesys+0xab/0xb5We have seen this before on some...]]></summary>
    <author>
        <name>nick</name>
        <uri>http://www.schmalenberger.us/</uri>
    </author>
    
        <category term="outage" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[We needed to reboot whetstone this morning because of <br />BUG: soft lockup detected on CPU#0!<br /><br />Call Trace:<br />&nbsp;&lt;IRQ&gt; [&lt;ffffffff8025894a&gt;] softlockup_tick+0xce/0xe0<br />&nbsp;[&lt;ffffffff8020df6c&gt;] timer_interrupt+0x3a8/0x402<br />&nbsp;[&lt;ffffffff80258c34&gt;] handle_IRQ_event+0x4e/0x96<br />&nbsp;[&lt;ffffffff80258d20&gt;] __do_IRQ+0xa4/0x105<br />&nbsp;[&lt;ffffffff8020bd6c&gt;] do_IRQ+0x44/0x4d<br />&nbsp;[&lt;ffffffff80351f4c&gt;] evtchn_do_upcall+0x19e/0x256<br />&nbsp;[&lt;ffffffff80209d8e&gt;] do_hypervisor_callback+0x1e/0x2c<br />&nbsp;&lt;EOI&gt; [&lt;ffffffff8035d93e&gt;] show_rd_sect+0x0/0x68<br />&nbsp;[&lt;ffffffff802ee0bc&gt;] __read_lock_failed+0x8/0x14<br />&nbsp;[&lt;ffffffff803494de&gt;] get_device+0x17/0x20<br />&nbsp;[&lt;ffffffff804024cd&gt;] .text.lock.spinlock+0x53/0x8a<br />&nbsp;[&lt;ffffffff8035d965&gt;] show_rd_sect+0x27/0x68<br />&nbsp;[&lt;ffffffff802be588&gt;] sysfs_read_file+0xa5/0x12c<br />&nbsp;[&lt;ffffffff8028031c&gt;] vfs_read+0xcb/0x171<br />&nbsp;[&lt;ffffffff802806fb&gt;] sys_read+0x45/0x6e<br />&nbsp;[&lt;ffffffff802097b2&gt;] tracesys+0xab/0xb5<br /><br />We have seen this before on some of our other dom0s so we're planning to upgrade them eventually to xen 4 if they have this problem. The downtime lasted 6 hours, users on whetstone will get a free month.<br /> ]]>
        
    </content>
</entry>

<entry>
    <title>rebuilding a disk on branch, expect some slowness</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2010/07/rebuilding-a-disk-on-branch-ex.html" />
    <id>tag:wiki.xen.prgmr.com,2010:/xenophilia//2.213</id>

    <published>2010-07-17T09:56:23Z</published>
    <updated>2010-07-17T10:04:34Z</updated>

    <summary> But no reboot. I&apos;m rebuilding off a hot spare. [lsc@branch ~]$ cat /proc/mdstat Personalities : [raid1] md2 : active raid1 sdc2[0] sda2[1] 478375424 blocks [2/2] [UU] md1 : active raid1 sdf2[2] sdd2[3](F) sde2[4](F) sdb2[0] 478375424 blocks [2/1] [U_] [=&gt;...................]...</summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[ <p>
But no reboot.  I'm rebuilding off a hot spare.  
</p>
<pre>
[lsc@branch ~]$ cat /proc/mdstat
Personalities : [raid1] 
md2 : active raid1 sdc2[0] sda2[1]
      478375424 blocks [2/2] [UU]
      
md1 : active raid1 sdf2[2] sdd2[3](F) sde2[4](F) sdb2[0]
      478375424 blocks [2/1] [U_]
      [=>...................]  recovery =  6.4% (30769792/478375424) finish=524.6min speed=14216K/sec
      
md0 : active raid1 sdf1[3] sdd1[4](F) sde1[5](F) sdc1[0] sdb1[1] sda1[2]
      10008384 blocks [4/4] [UUUU]
      
unused devices: <none>
[lsc@branch ~]$ 

</pre>]]>
        
    </content>
</entry>

<entry>
    <title>problem on stables... up domains are OK, down domains stay down.</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2010/07/problem-on-stables-up-domains.html" />
    <id>tag:wiki.xen.prgmr.com,2010:/xenophilia//2.212</id>

    <published>2010-07-17T00:35:55Z</published>
    <updated>2010-07-17T04:45:41Z</updated>

    <summary> [root@stables ~]# xm create -c billing_e_test Using config file &quot;/etc/xen/billing_e_test&quot;. Error: Creating domain failed: name=billing_e_test [root@stables ~]# however, domains that are up, are up. I can log into one of mine on that box, and all appears well. working...</summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[ <pre>
[root@stables ~]# xm create -c billing_e_test
Using config file "/etc/xen/billing_e_test".
Error: Creating domain failed: name=billing_e_test
[root@stables ~]# 
</pre>
<p>
however, domains that are up, are up.  I can log into one of mine on that box, and all appears well.  
</p><p>
working on it. </p>

<p>
everyone should be back up, you can reboot your domains now, too. 
</p>
]]>
        
    </content>
</entry>

<entry>
    <title>dish reboot</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2010/07/dish-reboot.html" />
    <id>tag:wiki.xen.prgmr.com,2010:/xenophilia//2.211</id>

    <published>2010-07-07T03:53:41Z</published>
    <updated>2010-07-07T07:28:25Z</updated>

    <summary><![CDATA[We just needed to reboot dish.prgmr.com because of this error:BUG: soft lockup detected on CPU#0!Call Trace:&nbsp;&lt;IRQ&gt; [&lt;ffffffff8025894a&gt;] softlockup_tick+0xce/0xe0&nbsp;[&lt;ffffffff8020df6c&gt;] timer_interrupt+0x3a8/0x402&nbsp;[&lt;ffffffff80258c34&gt;] handle_IRQ_event+0x4e/0x96&nbsp;[&lt;ffffffff80258d20&gt;] __do_IRQ+0xa4/0x105&nbsp;[&lt;ffffffff8020bd6c&gt;] do_IRQ+0x44/0x4d&nbsp;[&lt;ffffffff80351f4c&gt;] evtchn_do_upcall+0x19e/0x256&nbsp;[&lt;ffffffff80209d8e&gt;] do_hypervisor_callback+0x1e/0x2c&nbsp;&lt;EOI&gt; [&lt;ffffffff8035d93e&gt;] show_rd_sect+0x0/0x68&nbsp;[&lt;ffffffff802ee0b9&gt;] __read_lock_failed+0x5/0x14&nbsp;[&lt;ffffffff803494de&gt;] get_device+0x17/0x20&nbsp;[&lt;ffffffff8040415d&gt;] .text.lock.spinlock+0x53/0x8a&nbsp;[&lt;ffffffff8035d965&gt;] show_rd_sect+0x27/0x68&nbsp;[&lt;ffffffff802be588&gt;] sysfs_read_file+0xa5/0x12c&nbsp;[&lt;ffffffff8028031c&gt;] vfs_read+0xcb/0x171&nbsp;[&lt;ffffffff802806fb&gt;] sys_read+0x45/0x6e&nbsp;[&lt;ffffffff802097b2&gt;] tracesys+0xab/0xb5...]]></summary>
    <author>
        <name>nick</name>
        <uri>http://www.schmalenberger.us/</uri>
    </author>
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[We just needed to reboot dish.prgmr.com because of this error:<br />BUG: soft lockup detected on CPU#0!<br /><br />Call Trace:<br />&nbsp;&lt;IRQ&gt; [&lt;ffffffff8025894a&gt;] softlockup_tick+0xce/0xe0<br />&nbsp;[&lt;ffffffff8020df6c&gt;] timer_interrupt+0x3a8/0x402<br />&nbsp;[&lt;ffffffff80258c34&gt;] handle_IRQ_event+0x4e/0x96<br />&nbsp;[&lt;ffffffff80258d20&gt;] __do_IRQ+0xa4/0x105<br />&nbsp;[&lt;ffffffff8020bd6c&gt;] do_IRQ+0x44/0x4d<br />&nbsp;[&lt;ffffffff80351f4c&gt;] evtchn_do_upcall+0x19e/0x256<br />&nbsp;[&lt;ffffffff80209d8e&gt;] do_hypervisor_callback+0x1e/0x2c<br />&nbsp;&lt;EOI&gt; [&lt;ffffffff8035d93e&gt;] show_rd_sect+0x0/0x68<br />&nbsp;[&lt;ffffffff802ee0b9&gt;] __read_lock_failed+0x5/0x14<br />&nbsp;[&lt;ffffffff803494de&gt;] get_device+0x17/0x20<br />&nbsp;[&lt;ffffffff8040415d&gt;] .text.lock.spinlock+0x53/0x8a<br />&nbsp;[&lt;ffffffff8035d965&gt;] show_rd_sect+0x27/0x68<br />&nbsp;[&lt;ffffffff802be588&gt;] sysfs_read_file+0xa5/0x12c<br />&nbsp;[&lt;ffffffff8028031c&gt;] vfs_read+0xcb/0x171<br />&nbsp;[&lt;ffffffff802806fb&gt;] sys_read+0x45/0x6e<br />&nbsp;[&lt;ffffffff802097b2&gt;] tracesys+0xab/0xb5<br /><br /><br /> ]]>
        
    </content>
</entry>

<entry>
    <title>knife froze up hard and was rebooted</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2010/07/knife-froze-up-hard-and-was-re.html" />
    <id>tag:wiki.xen.prgmr.com,2010:/xenophilia//2.210</id>

    <published>2010-07-04T08:00:50Z</published>
    <updated>2010-07-04T08:01:19Z</updated>

    <summary> diagnostics ongoing....</summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
         diagnostics ongoing.
        
    </content>
</entry>

<entry>
    <title>hamper to be rebooted shortly</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2010/06/hamper-to-be-rebooted-shortly.html" />
    <id>tag:wiki.xen.prgmr.com,2010:/xenophilia//2.209</id>

    <published>2010-06-29T00:57:13Z</published>
    <updated>2010-06-29T00:58:38Z</updated>

    <summary> replacing more disk...</summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[<p>
replacing more disk
</p>
]]>
        
    </content>
</entry>

<entry>
    <title>Please comment on the first draft of the prgmr.com privacy policy </title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2010/06/please-comment-on-the-first-dr.html" />
    <id>tag:wiki.xen.prgmr.com,2010:/xenophilia//2.208</id>

    <published>2010-06-26T00:05:33Z</published>
    <updated>2010-07-09T22:44:05Z</updated>

    <summary> It has come to my attention that prgmr.com does not have a written, publicly accessible privacy policy. Below, I have pasted a first draft. Please give me feedback. Note, I&apos;ve been editing this draft in place... this is /not/...</summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[ <p>It has come to my attention that prgmr.com does not have a written, publicly accessible privacy policy.     Below, I have pasted a first draft.   Please give me feedback.  Note, I've been editing this draft in place... this is /not/ the final version, I'm just soliciting feedback.
</p>
<p>

<pre>
prgmr.com will not release private customer data except in the following cases:

1. in order to comply with ARIN requirements for new IP blocks, we will release
   the name or business name to ARIN. we will be executing the ARIN non-disclosure
   agreement, which requires that ARIN keep your names secret except in the case
   of a court order [1]

2. We will comply with any valid court orders issued by courts that have 
   jurisdiction.

3. we use automated and manual processes to examine network traffic while looking for problems.  

4. we will never examine your disk without permission.   (we may ask you to let us examine your disk or to leave, but if you don't give us permission, we won't examine the disk without a court order.)

5. we may examine network traffic with both manual and automated processes.   the results of this examination won't be shared without a court order.  

6. we may log and examine your serial console while looking for system problems. 




If this document needs to be amended, I will do my best to minimize the impact
on customers, and I will email the address on file with a notice.  If customers
wish to quit a long term contract because of an amendment to this document, any
early termination fees will be waived, and the customer will be given a prorated 
refund based on time used.  
 



[1]https://www.arin.net/resources/agreements/nda.pdf 
[
</pre>

<p>
Data retention is kindof a sticky thing.  See, the longer I keep the data, the easier it is for me to spot trends and ongoing problems.   but obviously, customers don't want me to keep shit around forever, and without a defined data retention policy, I think it's legally harder for me to tell law enforcement "we don't have that data" when they come knocking.
</p>
<p>

What if	I had a	clause that said "I give you access to all data I'm
retaining about you at  http://blah/customer" 	-  it would be
more work for me but it	would allow me to have longer data retention
(which is good for troubleshooting)  without pissing off customers, especially if I add a 'delete this' button... but I don't know where that puts me legally.
</p>
<p>
of course, that	is technically more difficult... but I could release
a tool that others could use.  	(I'd tie the login to the email)
</p>]]>
        
    </content>
</entry>

<entry>
    <title>roadblock in getting an ARIN allocation</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2010/06/roadblock-in-getting-an-arin-a.html" />
    <id>tag:wiki.xen.prgmr.com,2010:/xenophilia//2.207</id>

    <published>2010-06-24T21:12:59Z</published>
    <updated>2010-06-24T21:29:50Z</updated>

    <summary> so, for the past few days I&apos;ve been trying to get my initial IP allocation from ARIN. Here is what they say: hostmaster@arin.net writes: &gt; Hello, &gt; &gt; Thank you for your reply. This is close to what we...</summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[<p>
so, for the past few days I've been trying to get my initial IP allocation from ARIN.   Here is what they say:
</p>
<p>
<pre>
hostmaster@arin.net writes:

> Hello,
>
> Thank you for your reply.  This is close to what we needed but we
> still need you to provide the actual customer name for each IP
> assignment in the list provided please.
>
</pre>
</p>
<p>

I called and asked if this was also policy when DSL providers asked for IP addresses that would be statically assigned, and they said this was true of all static IP addresses. 
</p>
<p>
I explained that  my existing policy that	prevents me from releasing personal
information without a court order.

</p>
<p>
So,  for now, I will be buying more IPs from my upstream.   Until this is solved, we will not be giving people more than one IP per VPS.  
</p>
<p>
If you are okay with me giving your full name to ARIN (under NDA)  please email me.    if 1024 of you are okay with that, my problem is solved.  Please note, they aren't looking for email, postal address or anything else, just your name.
</p>]]>
        
    </content>
</entry>

<entry>
    <title> I messed up hamper when I was at the co-lo</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2010/06/it-looks-like-i-messed-up-hamp.html" />
    <id>tag:wiki.xen.prgmr.com,2010:/xenophilia//2.206</id>

    <published>2010-06-22T02:13:40Z</published>
    <updated>2010-06-22T06:39:02Z</updated>

    <summary> ugh. all users of hamper will get a month credit. heading down to work on it now. edit: hamper is back up. So, hamper had 4 drives (a stripe of mirrors) with a fifth, a spare. many months ago,...</summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[ ugh.  all users of hamper will get a month credit.    heading down to work on it now.  
<p>
edit:  hamper is back up.  
<p>
So, hamper had 4 drives (a stripe of mirrors)  with a fifth, a spare.   many months ago, one of the drives began failing.  I removed that drive from the raid, and rebuilt onto the hot spare.  
<p>
Earlier today, when I was dealing with the DoS, I thought I'd pull the drive and return it for warranty service.  Bad idea.    the computer siezed up.   
<p>
It appears that  for 8 hours, writes didn't go through to the hard drives.      I have reset the drive, and hamper appears functional again.   the outage appears to be from 02:36:36  to 20:46:47  PST
<p>
Other than the 8 hours of no writes, it appears that there was no data loss.    if you are on hamper and are still having problems, please let me know.  

<p>
This has encouraged me to accelerate my long-talked about backup plan.  ]]>
        
    </content>
</entry>

<entry>
    <title>DoS in the he.net fremont location</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2010/06/dos-in-the-henet-freemont-loca.html" />
    <id>tag:wiki.xen.prgmr.com,2010:/xenophilia//2.205</id>

    <published>2010-06-21T20:38:36Z</published>
    <updated>2010-06-21T21:35:20Z</updated>

    <summary>we&apos;re currently seeing 160 megabits/sec on a 100m pipe, so it&apos;s almost certainly a DoS of some sort. we&apos;re working on it, and hopefully will get it cleaned up faster than the problem at svtix. (It appears to be unrelated;...</summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        we&apos;re currently seeing 160 megabits/sec on a 100m pipe, so it&apos;s almost certainly a DoS of some sort.   we&apos;re working on it, and hopefully will get it cleaned up faster than the problem at svtix.  (It appears to be unrelated;  a different customer.)
        
    </content>
</entry>

<entry>
    <title>robe network driver</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2010/06/robe-network-driver.html" />
    <id>tag:wiki.xen.prgmr.com,2010:/xenophilia//2.204</id>

    <published>2010-06-21T12:55:25Z</published>
    <updated>2010-06-21T13:09:39Z</updated>

    <summary> Robe was down since yesterday with the too many iterations (6) in nv nic irq rx problem in the forcedeth ethernet driver. The problem was unrelated to the DDoS attack yesterday. Users on robe will get an additional free...</summary>
    <author>
        <name>nick</name>
        <uri>http://www.schmalenberger.us/</uri>
    </author>
    
        <category term="outage" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[ Robe was down since yesterday with <a href="http://book.xen.prgmr.com/mediawiki/index.php/Peth0:_too_many_iterations_%286%29_in_nv_nic_irq_rx.">the too many iterations (6) in nv nic irq rx problem</a> in the forcedeth ethernet driver. The problem was unrelated to the DDoS attack yesterday. Users on robe will get an additional free month.<br /><div><br /></div>]]>
        
    </content>
</entry>

<entry>
    <title>network problems at SVTIX</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2010/06/network-problems-at-svtix.html" />
    <id>tag:wiki.xen.prgmr.com,2010:/xenophilia//2.203</id>

    <published>2010-06-20T14:20:37Z</published>
    <updated>2010-06-20T15:01:02Z</updated>

    <summary> We&apos;ve been working on it for 8 hours. will update when we know more. Edit: looks like it /may/ be a Ddos. I mean, there /is/ a DDos of one of my customers, and that /may/ be why the...</summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[<p> We've been working on it for 8 hours.   will update when we know more.  
</p>
<p>
Edit: looks like it /may/ be a Ddos.   I mean, there /is/ a DDos of one of my customers, and that /may/ be why the router can't stand up.    I asked my upstream to blackhole the target IP (I hate finishing the job for the attacker, but at the moment it's my only choice.)    If that fixes it, we should be up within 1/2 hour.  If not, then I will probably need to replace my router, a process that will probably take closer to 5 hours.  
</p>
<p>
Yes, in fact it was a DDos.   null routing the target at my upstream solved the problem.
</p>
<p>
I do want to make a personal apology for taking more than 8 hours to figure this out (and it was nick who deserves credit for finally figuring it out, not me)    - I can explain some of it by the fact that the problem happened about the time I normally go to sleep, and the symptoms /looked/ a lot like the mac address conflict I had quite some time ago.  But still, I should have figured this out in a half an hour.  There's really no excuse.
</p>
<p>
As per policy, all effected customers will get 1 month credit.   (I probably won't get the credits sent out for a few days, but you /will/ get them)   - this will be painful, but not fatal. 
</p>
]]>
        
    </content>
</entry>

<entry>
    <title>sorry about the early past-due notices</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2010/06/sorry-about-the-early-pastdue.html" />
    <id>tag:wiki.xen.prgmr.com,2010:/xenophilia//2.202</id>

    <published>2010-06-18T04:50:21Z</published>
    <updated>2010-06-18T05:00:13Z</updated>

    <summary><![CDATA[There was a miscommunication, and past due notices were sent out to everyone more than 10 days late (most notably, this caught my customers who are paying twice a month because they have two accounts billed on different days.)&nbsp; when...]]></summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[There was a miscommunication, and past due notices were sent out to everyone more than 10 days late (most notably, this caught my customers who are paying twice a month because they have two accounts billed on different days.)&nbsp; when they should have only been sent to new people who were 10 days past due.&nbsp; (Obviously, I'm a little quicker to shut you off if you've never given me any money.)&nbsp; <br /><br />Anyhow, please accept my apology, and don't worry, we won't shut off existing customers for at least 30 days.&nbsp; <br /><br />edit:&nbsp; Please note;&nbsp; nobody was cut off.&nbsp; we just sent out the warning emails, and some people were understandably worried that their domain might get shut down. <br /> ]]>
        
    </content>
</entry>

<entry>
    <title>on logging serial consoles.</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2010/06/on-logging-serial-consoles.html" />
    <id>tag:wiki.xen.prgmr.com,2010:/xenophilia//2.201</id>

    <published>2010-06-15T20:40:04Z</published>
    <updated>2010-06-15T21:26:57Z</updated>

    <summary>So every now and again a customer will complain of a crashing domain. Occasionally, it is an early sign of a hardware problem that I need to deal with, so I don&apos;t want to just ignore it. Now, the problem...</summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
        <category term="Business" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="new features" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="security" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="privacy" label="privacy" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="serialconsole" label="serial console" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="support" label="support" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[<p>So every now and again a customer will complain of a crashing domain.    Occasionally, it is an early sign of a <a href="http://wiki.xen.prgmr.com/xenophilia/2009/06/see-this-is-why-i-dont-assume.html">hardware problem</a> that I need to deal with,  so I don't want to just ignore it.
</p><p>
Now, the problem is that like a physical server, once the domain has rebooted, most of the information about why it crashed is gone.   (and what little is left is in /var/log on the guest, and as a general rule we don't like mucking around in the guest.  that's your business, not ours.)  
</p><p>
Now, on a physical server, we solve this by using a logging serial console.  (I reccomend <a href="opengear.com">opengear</a> if you have the money, and a used cyclades if you don't have money.   the 'buddy system'  (making one server the console server for the next, then the next server the console server for the first) usually requires adding usb serial dongles, but is even cheaper still, for installations with only a few servers.   I  personally like the IOgear brand usb -> serial dongles Fry's has. 
</p><p>
I can turn on debug logging in xenconsoled and that will log the console for all domains to a file (one file for each domain)  then I can use those logs to troubleshoot the problem.   The thing is,  apparently some people have <a href="http://www.kuro5hin.org/story/2009/7/16/14112/0851">privacy concerns</a>  with this, so I haven't done it yet.
</p><p>
Now, personally, I don't think serial consoles are that sensitive.  I mean, it's common to leave terminals in data centers where passers by can see the output.   They will allow me to see what program is crashing, which may be sensitive, and depending on how you have the thing configured, I can see when people log in and log out.
</p><p>
So, I have several options.   
<ol>
<li>I could leave it as is, continue to go back and fourth and guess if someone asks me why something crashed after a reboot</li>
<li>I can log all consoles and delete the data once a week or once a month</li>
<li> I can apply a <a href="http://lists.xensource.com/archives/html/xen-devel/2010-05/msg01036.html">patch</a> to log some people's consoles and not others, and let the user decide</li>
</ol>
</p><p>
Obviously, option 2 makes my life a /whole lot/ easier.  Option 3 is better than option 1, but it still means maintaining an out of tree xenconsoled (or pushing it upstream)  
</p>]]>
        
    </content>
</entry>

<entry>
    <title>crock reboot again</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2010/05/crock-reboot-again.html" />
    <id>tag:wiki.xen.prgmr.com,2010:/xenophilia//2.200</id>

    <published>2010-05-29T20:21:29Z</published>
    <updated>2010-05-29T20:44:18Z</updated>

    <summary><![CDATA[Crock needed to reboot again because the dom0 kernel hanged, with the same error:BUG: soft lockup detected on CPU#0!Call Trace:&nbsp;&lt;IRQ&gt; [&lt;ffffffff8025758a&gt;] softlockup_tick+0xce/0xe0&nbsp;[&lt;ffffffff8020df48&gt;] timer_interrupt+0x3a0/0x3fa&nbsp;[&lt;ffffffff80257874&gt;] handle_IRQ_event+0x4e/0x96&nbsp;[&lt;ffffffff80257960&gt;] __do_IRQ+0xa4/0x105&nbsp;[&lt;ffffffff8020bd5c&gt;] do_IRQ+0x44/0x4d&nbsp;[&lt;ffffffff8034c980&gt;] evtchn_do_upcall+0x19e/0x250&nbsp;[&lt;ffffffff80209d8e&gt;] do_hypervisor_callback+0x1e/0x2c&nbsp;&lt;EOI&gt; [&lt;ffffffff803581ea&gt;] show_rd_sect+0x0/0x68&nbsp;[&lt;ffffffff802ebbf9&gt;] __read_lock_failed+0x5/0x14&nbsp;[&lt;ffffffff80343f3e&gt;] get_device+0x17/0x20&nbsp;[&lt;ffffffff803fc3fd&gt;] .text.lock.spinlock+0x53/0x8a&nbsp;[&lt;ffffffff80358211&gt;] show_rd_sect+0x27/0x68&nbsp;[&lt;ffffffff802bc351&gt;] sysfs_read_file+0xa5/0x12e&nbsp;[&lt;ffffffff8027e3f5&gt;] vfs_read+0xcb/0x171&nbsp;[&lt;ffffffff8027e7d4&gt;] sys_read+0x45/0x6e&nbsp;[&lt;ffffffff802097b2&gt;] tracesys+0xab/0xb5So we're thinking...]]></summary>
    <author>
        <name>nick</name>
        <uri>http://www.schmalenberger.us/</uri>
    </author>
    
        <category term="outage" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[Crock needed to reboot again because the dom0 kernel hanged, with the same error:<br />BUG: soft lockup detected on CPU#0!<br /><br />Call Trace:<br />&nbsp;&lt;IRQ&gt; [&lt;ffffffff8025758a&gt;] softlockup_tick+0xce/0xe0<br />&nbsp;[&lt;ffffffff8020df48&gt;] timer_interrupt+0x3a0/0x3fa<br />&nbsp;[&lt;ffffffff80257874&gt;] handle_IRQ_event+0x4e/0x96<br />&nbsp;[&lt;ffffffff80257960&gt;] __do_IRQ+0xa4/0x105<br />&nbsp;[&lt;ffffffff8020bd5c&gt;] do_IRQ+0x44/0x4d<br />&nbsp;[&lt;ffffffff8034c980&gt;] evtchn_do_upcall+0x19e/0x250<br />&nbsp;[&lt;ffffffff80209d8e&gt;] do_hypervisor_callback+0x1e/0x2c<br />&nbsp;&lt;EOI&gt; [&lt;ffffffff803581ea&gt;] show_rd_sect+0x0/0x68<br />&nbsp;[&lt;ffffffff802ebbf9&gt;] __read_lock_failed+0x5/0x14<br />&nbsp;[&lt;ffffffff80343f3e&gt;] get_device+0x17/0x20<br />&nbsp;[&lt;ffffffff803fc3fd&gt;] .text.lock.spinlock+0x53/0x8a<br />&nbsp;[&lt;ffffffff80358211&gt;] show_rd_sect+0x27/0x68<br />&nbsp;[&lt;ffffffff802bc351&gt;] sysfs_read_file+0xa5/0x12e<br />&nbsp;[&lt;ffffffff8027e3f5&gt;] vfs_read+0xcb/0x171<br />&nbsp;[&lt;ffffffff8027e7d4&gt;] sys_read+0x45/0x6e<br />&nbsp;[&lt;ffffffff802097b2&gt;] tracesys+0xab/0xb5<br /><br />So we're thinking this is a hardware problem and plan to put crock's disks into a new system that should be more stable.<br /> ]]>
        
    </content>
</entry>

</feed>
