<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
    <channel>
        <title>Xen hosting: Lessons from the Trenches</title>
        <link>http://wiki.xen.prgmr.com/xenophilia/</link>
        <description></description>
        <language>en-US</language>
        <copyright>Copyright 2010</copyright>
        <lastBuildDate>Mon, 26 Jul 2010 12:40:23 -0800</lastBuildDate>
        <generator>http://www.sixapart.com/movabletype/</generator>
        <docs>http://www.rssboard.org/rss-specification</docs>
        
        <item>
            <title>whetstone reboot</title>
            <description><![CDATA[We needed to reboot whetstone this morning because of <br />BUG: soft lockup detected on CPU#0!<br /><br />Call Trace:<br />&nbsp;&lt;IRQ&gt; [&lt;ffffffff8025894a&gt;] softlockup_tick+0xce/0xe0<br />&nbsp;[&lt;ffffffff8020df6c&gt;] timer_interrupt+0x3a8/0x402<br />&nbsp;[&lt;ffffffff80258c34&gt;] handle_IRQ_event+0x4e/0x96<br />&nbsp;[&lt;ffffffff80258d20&gt;] __do_IRQ+0xa4/0x105<br />&nbsp;[&lt;ffffffff8020bd6c&gt;] do_IRQ+0x44/0x4d<br />&nbsp;[&lt;ffffffff80351f4c&gt;] evtchn_do_upcall+0x19e/0x256<br />&nbsp;[&lt;ffffffff80209d8e&gt;] do_hypervisor_callback+0x1e/0x2c<br />&nbsp;&lt;EOI&gt; [&lt;ffffffff8035d93e&gt;] show_rd_sect+0x0/0x68<br />&nbsp;[&lt;ffffffff802ee0bc&gt;] __read_lock_failed+0x8/0x14<br />&nbsp;[&lt;ffffffff803494de&gt;] get_device+0x17/0x20<br />&nbsp;[&lt;ffffffff804024cd&gt;] .text.lock.spinlock+0x53/0x8a<br />&nbsp;[&lt;ffffffff8035d965&gt;] show_rd_sect+0x27/0x68<br />&nbsp;[&lt;ffffffff802be588&gt;] sysfs_read_file+0xa5/0x12c<br />&nbsp;[&lt;ffffffff8028031c&gt;] vfs_read+0xcb/0x171<br />&nbsp;[&lt;ffffffff802806fb&gt;] sys_read+0x45/0x6e<br />&nbsp;[&lt;ffffffff802097b2&gt;] tracesys+0xab/0xb5<br /><br />We have seen this before on some of our other dom0s so we're planning to upgrade them eventually to xen 4 if they have this problem. The downtime lasted 6 hours, users on whetstone will get a free month.<br /> ]]></description>
            <link>http://wiki.xen.prgmr.com/xenophilia/2010/07/whetstone-reboot.html</link>
            <guid>http://wiki.xen.prgmr.com/xenophilia/2010/07/whetstone-reboot.html</guid>
            
                <category domain="http://www.sixapart.com/ns/types#category">outage</category>
            
            
            <pubDate>Mon, 26 Jul 2010 12:40:23 -0800</pubDate>
        </item>
        
        <item>
            <title>rebuilding a disk on branch, expect some slowness</title>
            <description><![CDATA[ <p>
But no reboot.  I'm rebuilding off a hot spare.  
</p>
<pre>
[lsc@branch ~]$ cat /proc/mdstat
Personalities : [raid1] 
md2 : active raid1 sdc2[0] sda2[1]
      478375424 blocks [2/2] [UU]
      
md1 : active raid1 sdf2[2] sdd2[3](F) sde2[4](F) sdb2[0]
      478375424 blocks [2/1] [U_]
      [=>...................]  recovery =  6.4% (30769792/478375424) finish=524.6min speed=14216K/sec
      
md0 : active raid1 sdf1[3] sdd1[4](F) sde1[5](F) sdc1[0] sdb1[1] sda1[2]
      10008384 blocks [4/4] [UUUU]
      
unused devices: <none>
[lsc@branch ~]$ 

</pre>]]></description>
            <link>http://wiki.xen.prgmr.com/xenophilia/2010/07/rebuilding-a-disk-on-branch-ex.html</link>
            <guid>http://wiki.xen.prgmr.com/xenophilia/2010/07/rebuilding-a-disk-on-branch-ex.html</guid>
            
            
            <pubDate>Sat, 17 Jul 2010 02:56:23 -0800</pubDate>
        </item>
        
        <item>
            <title>problem on stables... up domains are OK, down domains stay down.</title>
            <description><![CDATA[ <pre>
[root@stables ~]# xm create -c billing_e_test
Using config file "/etc/xen/billing_e_test".
Error: Creating domain failed: name=billing_e_test
[root@stables ~]# 
</pre>
<p>
however, domains that are up, are up.  I can log into one of mine on that box, and all appears well.  
</p><p>
working on it. </p>

<p>
everyone should be back up, you can reboot your domains now, too. 
</p>
]]></description>
            <link>http://wiki.xen.prgmr.com/xenophilia/2010/07/problem-on-stables-up-domains.html</link>
            <guid>http://wiki.xen.prgmr.com/xenophilia/2010/07/problem-on-stables-up-domains.html</guid>
            
            
            <pubDate>Fri, 16 Jul 2010 17:35:55 -0800</pubDate>
        </item>
        
        <item>
            <title>dish reboot</title>
            <description><![CDATA[We just needed to reboot dish.prgmr.com because of this error:<br />BUG: soft lockup detected on CPU#0!<br /><br />Call Trace:<br />&nbsp;&lt;IRQ&gt; [&lt;ffffffff8025894a&gt;] softlockup_tick+0xce/0xe0<br />&nbsp;[&lt;ffffffff8020df6c&gt;] timer_interrupt+0x3a8/0x402<br />&nbsp;[&lt;ffffffff80258c34&gt;] handle_IRQ_event+0x4e/0x96<br />&nbsp;[&lt;ffffffff80258d20&gt;] __do_IRQ+0xa4/0x105<br />&nbsp;[&lt;ffffffff8020bd6c&gt;] do_IRQ+0x44/0x4d<br />&nbsp;[&lt;ffffffff80351f4c&gt;] evtchn_do_upcall+0x19e/0x256<br />&nbsp;[&lt;ffffffff80209d8e&gt;] do_hypervisor_callback+0x1e/0x2c<br />&nbsp;&lt;EOI&gt; [&lt;ffffffff8035d93e&gt;] show_rd_sect+0x0/0x68<br />&nbsp;[&lt;ffffffff802ee0b9&gt;] __read_lock_failed+0x5/0x14<br />&nbsp;[&lt;ffffffff803494de&gt;] get_device+0x17/0x20<br />&nbsp;[&lt;ffffffff8040415d&gt;] .text.lock.spinlock+0x53/0x8a<br />&nbsp;[&lt;ffffffff8035d965&gt;] show_rd_sect+0x27/0x68<br />&nbsp;[&lt;ffffffff802be588&gt;] sysfs_read_file+0xa5/0x12c<br />&nbsp;[&lt;ffffffff8028031c&gt;] vfs_read+0xcb/0x171<br />&nbsp;[&lt;ffffffff802806fb&gt;] sys_read+0x45/0x6e<br />&nbsp;[&lt;ffffffff802097b2&gt;] tracesys+0xab/0xb5<br /><br /><br /> ]]></description>
            <link>http://wiki.xen.prgmr.com/xenophilia/2010/07/dish-reboot.html</link>
            <guid>http://wiki.xen.prgmr.com/xenophilia/2010/07/dish-reboot.html</guid>
            
            
            <pubDate>Tue, 06 Jul 2010 20:53:41 -0800</pubDate>
        </item>
        
        <item>
            <title>knife froze up hard and was rebooted</title>
            <description> diagnostics ongoing.</description>
            <link>http://wiki.xen.prgmr.com/xenophilia/2010/07/knife-froze-up-hard-and-was-re.html</link>
            <guid>http://wiki.xen.prgmr.com/xenophilia/2010/07/knife-froze-up-hard-and-was-re.html</guid>
            
            
            <pubDate>Sun, 04 Jul 2010 01:00:50 -0800</pubDate>
        </item>
        
        <item>
            <title>hamper to be rebooted shortly</title>
            <description><![CDATA[<p>
replacing more disk
</p>
]]></description>
            <link>http://wiki.xen.prgmr.com/xenophilia/2010/06/hamper-to-be-rebooted-shortly.html</link>
            <guid>http://wiki.xen.prgmr.com/xenophilia/2010/06/hamper-to-be-rebooted-shortly.html</guid>
            
            
            <pubDate>Mon, 28 Jun 2010 17:57:13 -0800</pubDate>
        </item>
        
        <item>
            <title>Please comment on the first draft of the prgmr.com privacy policy </title>
            <description><![CDATA[ <p>It has come to my attention that prgmr.com does not have a written, publicly accessible privacy policy.     Below, I have pasted a first draft.   Please give me feedback.  Note, I've been editing this draft in place... this is /not/ the final version, I'm just soliciting feedback.
</p>
<p>

<pre>
prgmr.com will not release private customer data except in the following cases:

1. in order to comply with ARIN requirements for new IP blocks, we will release
   the name or business name to ARIN. we will be executing the ARIN non-disclosure
   agreement, which requires that ARIN keep your names secret except in the case
   of a court order [1]

2. We will comply with any valid court orders issued by courts that have 
   jurisdiction.

3. we use automated and manual processes to examine network traffic while looking for problems.  

4. we will never examine your disk without permission.   (we may ask you to let us examine your disk or to leave, but if you don't give us permission, we won't examine the disk without a court order.)

5. we may examine network traffic with both manual and automated processes.   the results of this examination won't be shared without a court order.  

6. we may log and examine your serial console while looking for system problems. 




If this document needs to be amended, I will do my best to minimize the impact
on customers, and I will email the address on file with a notice.  If customers
wish to quit a long term contract because of an amendment to this document, any
early termination fees will be waived, and the customer will be given a prorated 
refund based on time used.  
 



[1]https://www.arin.net/resources/agreements/nda.pdf 
[
</pre>

<p>
Data retention is kindof a sticky thing.  See, the longer I keep the data, the easier it is for me to spot trends and ongoing problems.   but obviously, customers don't want me to keep shit around forever, and without a defined data retention policy, I think it's legally harder for me to tell law enforcement "we don't have that data" when they come knocking.
</p>
<p>

What if	I had a	clause that said "I give you access to all data I'm
retaining about you at  http://blah/customer" 	-  it would be
more work for me but it	would allow me to have longer data retention
(which is good for troubleshooting)  without pissing off customers, especially if I add a 'delete this' button... but I don't know where that puts me legally.
</p>
<p>
of course, that	is technically more difficult... but I could release
a tool that others could use.  	(I'd tie the login to the email)
</p>]]></description>
            <link>http://wiki.xen.prgmr.com/xenophilia/2010/06/please-comment-on-the-first-dr.html</link>
            <guid>http://wiki.xen.prgmr.com/xenophilia/2010/06/please-comment-on-the-first-dr.html</guid>
            
            
            <pubDate>Fri, 25 Jun 2010 17:05:33 -0800</pubDate>
        </item>
        
        <item>
            <title>roadblock in getting an ARIN allocation</title>
            <description><![CDATA[<p>
so, for the past few days I've been trying to get my initial IP allocation from ARIN.   Here is what they say:
</p>
<p>
<pre>
hostmaster@arin.net writes:

> Hello,
>
> Thank you for your reply.  This is close to what we needed but we
> still need you to provide the actual customer name for each IP
> assignment in the list provided please.
>
</pre>
</p>
<p>

I called and asked if this was also policy when DSL providers asked for IP addresses that would be statically assigned, and they said this was true of all static IP addresses. 
</p>
<p>
I explained that  my existing policy that	prevents me from releasing personal
information without a court order.

</p>
<p>
So,  for now, I will be buying more IPs from my upstream.   Until this is solved, we will not be giving people more than one IP per VPS.  
</p>
<p>
If you are okay with me giving your full name to ARIN (under NDA)  please email me.    if 1024 of you are okay with that, my problem is solved.  Please note, they aren't looking for email, postal address or anything else, just your name.
</p>]]></description>
            <link>http://wiki.xen.prgmr.com/xenophilia/2010/06/roadblock-in-getting-an-arin-a.html</link>
            <guid>http://wiki.xen.prgmr.com/xenophilia/2010/06/roadblock-in-getting-an-arin-a.html</guid>
            
            
            <pubDate>Thu, 24 Jun 2010 14:12:59 -0800</pubDate>
        </item>
        
        <item>
            <title> I messed up hamper when I was at the co-lo</title>
            <description><![CDATA[ ugh.  all users of hamper will get a month credit.    heading down to work on it now.  
<p>
edit:  hamper is back up.  
<p>
So, hamper had 4 drives (a stripe of mirrors)  with a fifth, a spare.   many months ago, one of the drives began failing.  I removed that drive from the raid, and rebuilt onto the hot spare.  
<p>
Earlier today, when I was dealing with the DoS, I thought I'd pull the drive and return it for warranty service.  Bad idea.    the computer siezed up.   
<p>
It appears that  for 8 hours, writes didn't go through to the hard drives.      I have reset the drive, and hamper appears functional again.   the outage appears to be from 02:36:36  to 20:46:47  PST
<p>
Other than the 8 hours of no writes, it appears that there was no data loss.    if you are on hamper and are still having problems, please let me know.  

<p>
This has encouraged me to accelerate my long-talked about backup plan.  ]]></description>
            <link>http://wiki.xen.prgmr.com/xenophilia/2010/06/it-looks-like-i-messed-up-hamp.html</link>
            <guid>http://wiki.xen.prgmr.com/xenophilia/2010/06/it-looks-like-i-messed-up-hamp.html</guid>
            
            
            <pubDate>Mon, 21 Jun 2010 19:13:40 -0800</pubDate>
        </item>
        
        <item>
            <title>DoS in the he.net fremont location</title>
            <description>we&apos;re currently seeing 160 megabits/sec on a 100m pipe, so it&apos;s almost certainly a DoS of some sort.   we&apos;re working on it, and hopefully will get it cleaned up faster than the problem at svtix.  (It appears to be unrelated;  a different customer.)</description>
            <link>http://wiki.xen.prgmr.com/xenophilia/2010/06/dos-in-the-henet-freemont-loca.html</link>
            <guid>http://wiki.xen.prgmr.com/xenophilia/2010/06/dos-in-the-henet-freemont-loca.html</guid>
            
            
            <pubDate>Mon, 21 Jun 2010 13:38:36 -0800</pubDate>
        </item>
        
        <item>
            <title>robe network driver</title>
            <description><![CDATA[ Robe was down since yesterday with <a href="http://book.xen.prgmr.com/mediawiki/index.php/Peth0:_too_many_iterations_%286%29_in_nv_nic_irq_rx.">the too many iterations (6) in nv nic irq rx problem</a> in the forcedeth ethernet driver. The problem was unrelated to the DDoS attack yesterday. Users on robe will get an additional free month.<br /><div><br /></div>]]></description>
            <link>http://wiki.xen.prgmr.com/xenophilia/2010/06/robe-network-driver.html</link>
            <guid>http://wiki.xen.prgmr.com/xenophilia/2010/06/robe-network-driver.html</guid>
            
                <category domain="http://www.sixapart.com/ns/types#category">outage</category>
            
            
            <pubDate>Mon, 21 Jun 2010 05:55:25 -0800</pubDate>
        </item>
        
        <item>
            <title>network problems at SVTIX</title>
            <description><![CDATA[<p> We've been working on it for 8 hours.   will update when we know more.  
</p>
<p>
Edit: looks like it /may/ be a Ddos.   I mean, there /is/ a DDos of one of my customers, and that /may/ be why the router can't stand up.    I asked my upstream to blackhole the target IP (I hate finishing the job for the attacker, but at the moment it's my only choice.)    If that fixes it, we should be up within 1/2 hour.  If not, then I will probably need to replace my router, a process that will probably take closer to 5 hours.  
</p>
<p>
Yes, in fact it was a DDos.   null routing the target at my upstream solved the problem.
</p>
<p>
I do want to make a personal apology for taking more than 8 hours to figure this out (and it was nick who deserves credit for finally figuring it out, not me)    - I can explain some of it by the fact that the problem happened about the time I normally go to sleep, and the symptoms /looked/ a lot like the mac address conflict I had quite some time ago.  But still, I should have figured this out in a half an hour.  There's really no excuse.
</p>
<p>
As per policy, all effected customers will get 1 month credit.   (I probably won't get the credits sent out for a few days, but you /will/ get them)   - this will be painful, but not fatal. 
</p>
]]></description>
            <link>http://wiki.xen.prgmr.com/xenophilia/2010/06/network-problems-at-svtix.html</link>
            <guid>http://wiki.xen.prgmr.com/xenophilia/2010/06/network-problems-at-svtix.html</guid>
            
            
            <pubDate>Sun, 20 Jun 2010 07:20:37 -0800</pubDate>
        </item>
        
        <item>
            <title>sorry about the early past-due notices</title>
            <description><![CDATA[There was a miscommunication, and past due notices were sent out to everyone more than 10 days late (most notably, this caught my customers who are paying twice a month because they have two accounts billed on different days.)&nbsp; when they should have only been sent to new people who were 10 days past due.&nbsp; (Obviously, I'm a little quicker to shut you off if you've never given me any money.)&nbsp; <br /><br />Anyhow, please accept my apology, and don't worry, we won't shut off existing customers for at least 30 days.&nbsp; <br /><br />edit:&nbsp; Please note;&nbsp; nobody was cut off.&nbsp; we just sent out the warning emails, and some people were understandably worried that their domain might get shut down. <br /> ]]></description>
            <link>http://wiki.xen.prgmr.com/xenophilia/2010/06/sorry-about-the-early-pastdue.html</link>
            <guid>http://wiki.xen.prgmr.com/xenophilia/2010/06/sorry-about-the-early-pastdue.html</guid>
            
            
            <pubDate>Thu, 17 Jun 2010 21:50:21 -0800</pubDate>
        </item>
        
        <item>
            <title>on logging serial consoles.</title>
            <description><![CDATA[<p>So every now and again a customer will complain of a crashing domain.    Occasionally, it is an early sign of a <a href="http://wiki.xen.prgmr.com/xenophilia/2009/06/see-this-is-why-i-dont-assume.html">hardware problem</a> that I need to deal with,  so I don't want to just ignore it.
</p><p>
Now, the problem is that like a physical server, once the domain has rebooted, most of the information about why it crashed is gone.   (and what little is left is in /var/log on the guest, and as a general rule we don't like mucking around in the guest.  that's your business, not ours.)  
</p><p>
Now, on a physical server, we solve this by using a logging serial console.  (I reccomend <a href="opengear.com">opengear</a> if you have the money, and a used cyclades if you don't have money.   the 'buddy system'  (making one server the console server for the next, then the next server the console server for the first) usually requires adding usb serial dongles, but is even cheaper still, for installations with only a few servers.   I  personally like the IOgear brand usb -> serial dongles Fry's has. 
</p><p>
I can turn on debug logging in xenconsoled and that will log the console for all domains to a file (one file for each domain)  then I can use those logs to troubleshoot the problem.   The thing is,  apparently some people have <a href="http://www.kuro5hin.org/story/2009/7/16/14112/0851">privacy concerns</a>  with this, so I haven't done it yet.
</p><p>
Now, personally, I don't think serial consoles are that sensitive.  I mean, it's common to leave terminals in data centers where passers by can see the output.   They will allow me to see what program is crashing, which may be sensitive, and depending on how you have the thing configured, I can see when people log in and log out.
</p><p>
So, I have several options.   
<ol>
<li>I could leave it as is, continue to go back and fourth and guess if someone asks me why something crashed after a reboot</li>
<li>I can log all consoles and delete the data once a week or once a month</li>
<li> I can apply a <a href="http://lists.xensource.com/archives/html/xen-devel/2010-05/msg01036.html">patch</a> to log some people's consoles and not others, and let the user decide</li>
</ol>
</p><p>
Obviously, option 2 makes my life a /whole lot/ easier.  Option 3 is better than option 1, but it still means maintaining an out of tree xenconsoled (or pushing it upstream)  
</p>]]></description>
            <link>http://wiki.xen.prgmr.com/xenophilia/2010/06/on-logging-serial-consoles.html</link>
            <guid>http://wiki.xen.prgmr.com/xenophilia/2010/06/on-logging-serial-consoles.html</guid>
            
                <category domain="http://www.sixapart.com/ns/types#category">Business</category>
            
                <category domain="http://www.sixapart.com/ns/types#category">new features</category>
            
                <category domain="http://www.sixapart.com/ns/types#category">security</category>
            
            
                <category domain="http://www.sixapart.com/ns/types#tag">privacy</category>
            
                <category domain="http://www.sixapart.com/ns/types#tag">serial console</category>
            
                <category domain="http://www.sixapart.com/ns/types#tag">support</category>
            
            <pubDate>Tue, 15 Jun 2010 13:40:04 -0800</pubDate>
        </item>
        
        <item>
            <title>crock reboot again</title>
            <description><![CDATA[Crock needed to reboot again because the dom0 kernel hanged, with the same error:<br />BUG: soft lockup detected on CPU#0!<br /><br />Call Trace:<br />&nbsp;&lt;IRQ&gt; [&lt;ffffffff8025758a&gt;] softlockup_tick+0xce/0xe0<br />&nbsp;[&lt;ffffffff8020df48&gt;] timer_interrupt+0x3a0/0x3fa<br />&nbsp;[&lt;ffffffff80257874&gt;] handle_IRQ_event+0x4e/0x96<br />&nbsp;[&lt;ffffffff80257960&gt;] __do_IRQ+0xa4/0x105<br />&nbsp;[&lt;ffffffff8020bd5c&gt;] do_IRQ+0x44/0x4d<br />&nbsp;[&lt;ffffffff8034c980&gt;] evtchn_do_upcall+0x19e/0x250<br />&nbsp;[&lt;ffffffff80209d8e&gt;] do_hypervisor_callback+0x1e/0x2c<br />&nbsp;&lt;EOI&gt; [&lt;ffffffff803581ea&gt;] show_rd_sect+0x0/0x68<br />&nbsp;[&lt;ffffffff802ebbf9&gt;] __read_lock_failed+0x5/0x14<br />&nbsp;[&lt;ffffffff80343f3e&gt;] get_device+0x17/0x20<br />&nbsp;[&lt;ffffffff803fc3fd&gt;] .text.lock.spinlock+0x53/0x8a<br />&nbsp;[&lt;ffffffff80358211&gt;] show_rd_sect+0x27/0x68<br />&nbsp;[&lt;ffffffff802bc351&gt;] sysfs_read_file+0xa5/0x12e<br />&nbsp;[&lt;ffffffff8027e3f5&gt;] vfs_read+0xcb/0x171<br />&nbsp;[&lt;ffffffff8027e7d4&gt;] sys_read+0x45/0x6e<br />&nbsp;[&lt;ffffffff802097b2&gt;] tracesys+0xab/0xb5<br /><br />So we're thinking this is a hardware problem and plan to put crock's disks into a new system that should be more stable.<br /> ]]></description>
            <link>http://wiki.xen.prgmr.com/xenophilia/2010/05/crock-reboot-again.html</link>
            <guid>http://wiki.xen.prgmr.com/xenophilia/2010/05/crock-reboot-again.html</guid>
            
                <category domain="http://www.sixapart.com/ns/types#category">outage</category>
            
            
            <pubDate>Sat, 29 May 2010 13:21:29 -0800</pubDate>
        </item>
        
    </channel>
</rss>
