<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>Xen hosting: Lessons from the Trenches</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/" />
    <link rel="self" type="application/atom+xml" href="http://wiki.xen.prgmr.com/xenophilia/atom.xml" />
    <id>tag:wiki.xen.prgmr.com,2008-03-02:/xenophilia/2</id>
    <updated>2010-02-07T21:32:57Z</updated>
    
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type Open Source 4.1</generator>

<entry>
    <title>sloooow disk I/O on horn due to bad disk</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2010/02/sloooow-disk-io-on-horn-due-to.html" />
    <id>tag:wiki.xen.prgmr.com,2010:/xenophilia//2.184</id>

    <published>2010-02-07T21:32:32Z</published>
    <updated>2010-02-07T21:32:57Z</updated>

    <summary><![CDATA[I'm heading out to fix it right now.&nbsp;...]]></summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[I'm heading out to fix it right now.&nbsp; ]]>
        
    </content>
</entry>

<entry>
    <title>IPv6 router upgrade at SVTIX</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2010/01/ipv6-router-upgrade-at-svtix.html" />
    <id>tag:wiki.xen.prgmr.com,2010:/xenophilia//2.183</id>

    <published>2010-01-31T20:33:42Z</published>
    <updated>2010-01-31T20:35:26Z</updated>

    <summary><![CDATA[(there are only a few of you on it) &nbsp; we're rebooting our experimental IPv6 router for testing... it shouldn't be more than a few minutes downtime for IPv6.&nbsp;...]]></summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[(there are only a few of you on it) &nbsp; <br /><br />we're rebooting our experimental IPv6 router for testing... it shouldn't be more than a few minutes downtime for IPv6.&nbsp; <br /> ]]>
        
    </content>
</entry>

<entry>
    <title>partial network outage last night</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2010/01/partial-network-outage-last-ni.html" />
    <id>tag:wiki.xen.prgmr.com,2010:/xenophilia//2.182</id>

    <published>2010-01-04T01:39:22Z</published>
    <updated>2010-01-04T01:43:22Z</updated>

    <summary><![CDATA[my provider tells me there was an intermittent network outage at my Fremont he.net location (my reseller, not he.net)&nbsp; from 11pm to 1am PST. &nbsp;...]]></summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[my provider tells me there was an intermittent network outage at my Fremont he.net location (my reseller, not he.net)&nbsp; from 11pm to 1am PST. &nbsp; ]]>
        
    </content>
</entry>

<entry>
    <title>stables and birds going down for update and move</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2009/12/stables-and-birds-going-down-f.html" />
    <id>tag:wiki.xen.prgmr.com,2009:/xenophilia//2.181</id>

    <published>2009-12-31T06:04:55Z</published>
    <updated>2009-12-31T06:07:46Z</updated>

    <summary>they are in one of our supermicro 2 in 1u units...</summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        they are in one of our supermicro 2 in 1u units 
        
    </content>
</entry>

<entry>
    <title>hydra rebooting shortly</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2009/12/hydra-rebooting-shortly.html" />
    <id>tag:wiki.xen.prgmr.com,2009:/xenophilia//2.180</id>

    <published>2009-12-31T02:55:00Z</published>
    <updated>2009-12-31T03:10:31Z</updated>

    <summary><![CDATA[we're trying to see if we can xm save like we did on lion, unlike we did on boar, but it's a pretty old box, so we might be rebooting you.&nbsp; [root@hydra /]# uptime&nbsp;18:56:26 up 410 days, 15:44,&nbsp; 2 users,&nbsp;...]]></summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
        <category term="hosting status" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="outage" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[we're trying to see if we can xm save like we did on lion, unlike we did on boar, but it's a pretty old box, so we might be rebooting you.&nbsp; <br /><br />[root@hydra /]# uptime<br />&nbsp;18:56:26 up 410 days, 15:44,&nbsp; 2 users,&nbsp; load average: 0.09, 0.29, 0.25<br /><br /> ]]>
        
    </content>
</entry>

<entry>
    <title>boar.prgmr.com going down shortly</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2009/12/boarprgmrcom-going-down-shortl.html" />
    <id>tag:wiki.xen.prgmr.com,2009:/xenophilia//2.179</id>

    <published>2009-12-31T00:24:57Z</published>
    <updated>2009-12-31T00:27:10Z</updated>

    <summary><![CDATA[just like lion, save for that there are fewer customers on boar.[root@boar ~]# uptime&nbsp;16:24:55 up 410 days, 15:32,&nbsp; 3 users,&nbsp; load average: 0.00, 0.01, 0.00[root@boar ~]# xm list |wc -l18starting upgrade now (service won't be impacted until we start the...]]></summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[just like lion, save for that there are fewer customers on boar.<br /><br />[root@boar ~]# uptime<br />&nbsp;16:24:55 up 410 days, 15:32,&nbsp; 3 users,&nbsp; load average: 0.00, 0.01, 0.00<br /><br />[root@boar ~]# xm list |wc -l<br />18<br /><br /><br />starting upgrade now (service won't be impacted until we start the reboot)<br /><br /><br /> ]]>
        
    </content>
</entry>

<entry>
    <title>(short) downtime on hydra, stables, birds and boar - tomorow</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2009/12/short-downtime-on-hydra-stable.html" />
    <id>tag:wiki.xen.prgmr.com,2009:/xenophilia//2.178</id>

    <published>2009-12-30T07:15:13Z</published>
    <updated>2009-12-30T07:18:29Z</updated>

    <summary><![CDATA[all servers will be rebooted (as lion was today)&nbsp; for some kernel upgrades, and to consolidate all my he.net servers to one rack. &nbsp;...]]></summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
        <category term="hosting status" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="outage" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[all servers will be rebooted (as lion was today)&nbsp; for some kernel upgrades, and to consolidate all my he.net servers to one rack. &nbsp; ]]>
        
    </content>
</entry>

<entry>
    <title>~20 min network outage today</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2009/12/20-min-network-outage-today.html" />
    <id>tag:wiki.xen.prgmr.com,2009:/xenophilia//2.177</id>

    <published>2009-12-30T04:25:23Z</published>
    <updated>2009-12-30T04:27:20Z</updated>

    <summary><![CDATA[from 11:58 to 12:19 PST&nbsp; -&nbsp; We suspect upstream network trouble as the cause.&nbsp;...]]></summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[from 11:58 to 12:19 PST&nbsp; -&nbsp; We suspect upstream network trouble as the cause.&nbsp; ]]>
        
    </content>
</entry>

<entry>
    <title>lion rebooting for kernel refresh and a move</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2009/12/lion-rebooting-for-kernel-refr.html" />
    <id>tag:wiki.xen.prgmr.com,2009:/xenophilia//2.176</id>

    <published>2009-12-30T02:13:28Z</published>
    <updated>2009-12-30T02:15:13Z</updated>

    <summary><![CDATA[root@lion ~]# uptime&nbsp;18:12:12 up 451 days, 17:42,&nbsp; 8 users,&nbsp; load average: 0.01, 0.03, 0.00as usual, if we don't screw it up it will be 20 minutes downtime and no reboot for you, due to xm save/restore...]]></summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
        <category term="hardware" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="hosting status" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="outage" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[root@lion ~]# uptime<br />&nbsp;18:12:12 up 451 days, 17:42,&nbsp; 8 users,&nbsp; load average: 0.01, 0.03, 0.00<br /><br /><br />as usual, if we don't screw it up it will be 20 minutes downtime and no reboot for you, due to xm save/restore<br /><br /> ]]>
        
    </content>
</entry>

<entry>
    <title>prgmr.com network outage tonight</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2009/12/prgmrcom-network-outage-tonigh.html" />
    <id>tag:wiki.xen.prgmr.com,2009:/xenophilia//2.175</id>

    <published>2009-12-09T13:21:47Z</published>
    <updated>2009-12-09T13:24:32Z</updated>

    <summary><![CDATA[ &nbsp; I screwed up. I gave two customers the same IP and mac address. this took down the network. the problem should be fixed nowas far as I can tell we were down about two hours.&nbsp;...]]></summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
        <category term="outage" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="humanerror" label="human error" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[<span class="status-body">      
              <span class="actions"><div>&nbsp;      <br /></div></span>
        <span class="entry-content">I screwed up.  I gave two customers the same IP and mac address.  this took down the network. the problem should be fixed now<br /><br />as far as I can tell we were down about two hours.&nbsp; <br /><br /><br /><br /></span></span> ]]>
        
    </content>
</entry>

<entry>
    <title>rebooting cerberus to replace a bad drive.  Ugh.  never again sata_nv</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2009/12/rebooting-cerberus-to-replace.html" />
    <id>tag:wiki.xen.prgmr.com,2009:/xenophilia//2.174</id>

    <published>2009-12-03T06:59:46Z</published>
    <updated>2009-12-03T07:04:42Z</updated>

    <summary><![CDATA[yeah.&nbsp; so sata_nv doesn't hot swap on the 2.6.18 xen kernel that comes with xen 3.3.&nbsp; ugh.&nbsp; if nothing goes too badly, you won't notice the reboot;&nbsp; your host will only be unreachable fora few minutes, you will not actually...]]></summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[yeah.&nbsp; so sata_nv doesn't hot swap on the 2.6.18 xen kernel that comes with xen 3.3.&nbsp; ugh.&nbsp; <br /><br />if nothing goes too badly, you won't notice the reboot;&nbsp; your host will only be unreachable for<br />a few minutes, you will not actually experience a reboot.&nbsp; <br /> ]]>
        
    </content>
</entry>

<entry>
    <title>IPv6 RA blocked in Fremont</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2009/11/ipv6-ra-blocked-in-fremont.html" />
    <id>tag:wiki.xen.prgmr.com,2009:/xenophilia//2.172</id>

    <published>2009-11-23T22:50:24Z</published>
    <updated>2009-11-23T23:06:12Z</updated>

    <summary>So last week at the he.net data center in Fremont, one of our providers switches rebooted and afterwards we haven&apos;t been receiving any IPv6 router advertisements on our lan. We contacted them immediately, but it still hasn&apos;t been fixed. Meanwhile,...</summary>
    <author>
        <name>nick</name>
        <uri>http://www.schmalenberger.us/</uri>
    </author>
    
        <category term="outage" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        So last week at the he.net data center in Fremont, one of our providers switches rebooted and afterwards we haven&apos;t been receiving any IPv6 router advertisements on our lan. We contacted them immediately, but it still hasn&apos;t been fixed. Meanwhile, the workaround is to manually add a static default route and what the autoconfigured address would be if the RAs were being received. For example in linux, the command would be ip -6 route add default via 2001:470:1:41::1 dev eth0 for the default route and ip addr add 2001:470:1:41:(replace with host portion of your ipv6 address)/64 dev eth0. 
        
    </content>
</entry>

<entry>
    <title>do pci-express post diagnostic cards exist?</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2009/11/do-pciexpress-post-diagnostic.html" />
    <id>tag:wiki.xen.prgmr.com,2009:/xenophilia//2.171</id>

    <published>2009-11-06T04:39:58Z</published>
    <updated>2009-11-06T04:44:14Z</updated>

    <summary> maybe I&quot;m just lazy and blind, but I can&apos;t find any post diagnostic cards that work with modern pci-express only servers for those of you who are not janitors (or who work exclusively with tyan boards, which have an...</summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
        <category term="hardware" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="computerjanitordiaries" label="computer janitor diaries" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="hardware" label="hardware" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="postdiagnosticcard" label="post diagnostic card" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[ <p>maybe I"m just lazy and blind, but I can't find any post diagnostic cards that work with modern pci-express only servers
</p>
<p>
for those of you who are not janitors (or who work exclusively with tyan boards, which have an awesome post diagnostic LED on-board)  a post diagnostic card reads I/O port 0080h, and prints out a 2 digit hex number on an LED display, which you can look up in the back of your motherboard manual to figure out what the hell is going on when your server is so goddamn loud you can't hear the goddamn beep codes.  I had one that was PCI, and see many that are PCI and ISA, but I see none that would fit in my PCI-e only server.  And I've got a supermicro that won't boot.  One side of a 2 in 1u, too. 
</p>
<p>
Example of a ISA/PCI post diagnostic card:  
http://www.elstonsystems.com/prod/pc_post_diagnostic_card.html
</p>
<p>
I need one of those, only I need it to plug into usb or serial or pci-express or some port my supermicro 2 in 1u servers have.   
</p>
<p>
Meanwhile, I went ahead and overpaid for another stand-alone supermicro motherboard and hooked it into one of my old 3U chassis, so I should have capacity sometime later this week. 
</p>. ]]>
        
    </content>
</entry>

<entry>
    <title>Xen domains not starting up.  </title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2009/10/xen-domains-not-starting-up.html" />
    <id>tag:wiki.xen.prgmr.com,2009:/xenophilia//2.170</id>

    <published>2009-10-19T03:12:34Z</published>
    <updated>2009-10-19T11:09:15Z</updated>

    <summary><![CDATA[http://book.xen.prgmr.com/mediawiki/index.php/Vif_doesnt_go_away_when_shutting_downalso: http://lists.xensource.com/archives/html/xen-devel/2009-10/msg00873.htmlworst-case I will reboot the server with a known-good xen kernel.&nbsp; (I was using 3.4.2-rc;&nbsp; I will downgrade to 3.4.1.)&nbsp; One way or another, the problem will be solved tonight.Users who have not shut down are thusfar unaffected.&nbsp; I...]]></summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
        <category term="hosting status" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="outage" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[http://book.xen.prgmr.com/mediawiki/index.php/Vif_doesnt_go_away_when_shutting_down<br /><br />also: http://lists.xensource.com/archives/html/xen-devel/2009-10/msg00873.html<br /><br />worst-case I will reboot the server with a known-good xen kernel.&nbsp; (I was using 3.4.2-rc;&nbsp; I will downgrade to 3.4.1.)&nbsp; <br /><br />One way or another, the problem will be solved tonight.<br /><br />Users who have not shut down are thusfar unaffected.&nbsp; I will try to reboot without&nbsp; interfering much with their operation. <br /><br />I should have reported this the other night<br /><br /><br />Update:&nbsp; rebooting branch (2:56 PST on Oct 19th.&nbsp; )<br />update:&nbsp; branch is up and everything is confirmed good (4:08 PSD on Oct 19th)<br /><br />&nbsp;<br /> ]]>
        
    </content>
</entry>

<entry>
    <title>emergency network maintainince (also, upgrading to gigabit)</title>
    <link rel="alternate" type="text/html" href="http://wiki.xen.prgmr.com/xenophilia/2009/10/emergency-network-maintainince.html" />
    <id>tag:wiki.xen.prgmr.com,2009:/xenophilia//2.169</id>

    <published>2009-10-15T02:11:05Z</published>
    <updated>2009-10-15T05:49:41Z</updated>

    <summary>we&apos;ve been having some mysterious packet loss issues that look a lot like we are oversubscribing a 50Mbps connection, but we&apos;re not. we have a 100Mbps commit on a gig port. Our upstream believes the problem to be with our...</summary>
    <author>
        <name>luke</name>
        <uri>http://prgmr.com</uri>
    </author>
    
        <category term="hosting status" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="outage" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="troubleshooting" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en-US" xml:base="http://wiki.xen.prgmr.com/xenophilia/">
        <![CDATA[we've been having some mysterious packet loss issues that look a lot like we are oversubscribing a 50Mbps connection, but we're not.  we have a 100Mbps commit on a gig port.  Our upstream believes the problem to be with our router, so I've been working on it.    anyhow, I found this in my foundry:
<p>
<pre>
BR-charon#show ip traffic
IP Statistics
  1916350241 received, 52550054 sent, 585142657 forwarded
  629624 filtered, 67 fragmented, 78 reassembled, 2033812 bad header
  14173 no route, 0 unknown proto, 0 no buffer, 632845943 other errors
</pre>
<p>
so we're swapping it out with a procurve 2824 with firmware that was fresh this year (downright modern!)   
</P>
<p>
anyhow, there shouldn't be more than 120 seconds or so downtime for anyone.  we're doing the move  incrementally.  It should be done tonight.
</p>

Update:  we are done.   ]]>
        
    </content>
</entry>

</feed>
