Tea Time – 11/29/2015 – The Elastic Internet

It is interesting that Google alone boasts more than 2.4 million servers (http://www.circleid.com/posts/20101021_googles_spending_spree_24_million_servers_and_counting/) and Microsoft had crossed the 1 million server mark back in 2009 (http://www.extremetech.com/extreme/161772-microsoft-now-has-one-million-servers-less-than-google-but-more-than-amazon-says-ballmer).  Assuming that each server draws about 750 watts (between compute and storage nodes) that means at least 3.4 million servers times 750 watts or 2,550,000 kilowatts of power consumption.  On top of this, there is a requirement for cooling.  Suppose that the cooling is done via heat-pumps (air conditioners).  With an energy efficiency ratio (EER) of 8 (https://www.e-education.psu.edu/egee102/node/2106) and a conversion of 3.41 BTU/hr per watt  means an additional 1,086,937 kilo watts must be included to remove the heat of the servers.  So, 3,636,937 kilowatts total to keep the servers online.  Coal energy is measured in kilowatt hours per ton (https://www.eia.gov/tools/faqs/faq.cfm?id=667&t=2) so we convert to kwh by multiplying by the number of hours per year -> 365.25 * 24 = 8,766 hours / year.

From the above calculations, the total KW per year needed for Google and Microsoft is 31,881,394,125 kwh.  Using a conversion of 1904 KWH per ton of coal, it thus takes 16,744,430 tons or 15,222,209,091 kilograms of coal per year to keep them running.  The density of coal depends on the type.  However, we can use an average between 641 and 929 kg / cubic meter (http://www.ask.com/science/bulk-density-coal-e55167b75b4deafc) or 785 kg / cubic meter to figure out the size of the coal lump we will need.    Dividing the number of kg of coal required by the density gives us 19,391,349 cubic meters.  Taking the cube root of that gives us a cubic lump about 269 meters per side.  For those using American units, we multiply by 3.28  feet / meter to arrive at 881 feet per side or a little less than 2 and a half American football fields per side.

This is where elasticity comes into play.  An elastic server application uses software driven automation to determine whether a server is heavily loaded or lightly loaded and can take steps to remove power from lightly loaded servers. Thus, depending on how loaded the servers are at a given time of day, the number of servers drawing power may be decreased and thus the size of that lump of coal.  A really good application for such automation is streaming video. The majority of the population sleeps at night and works during the day.  Thus, servers that provide video to populations can be assumed to have their heaviest load during the evening hours when people are off of work but not asleep.  The rest of the day would be a light load on those servers.  Assuming 8 hours of heavy load out of a day this means a savings of 2/3 or 10,148,394 kg of coal per year. Now wouldn’t that be fabulous?

Tea Time – 11/7/2015

It must have been 25 years ago when it dawned on me that, in the future, we would have fabulous products that didn’t work all the time.  You see, I was working at a consumer electronics company churning out circuits and code for gadgets at a feverish pace.  The rule was you had to get the product fast to market and at a very low price.  I reflect on that now as I attempt to use my tablet computer and the application crashes.  You see, reliability is inversely proportional to complexity.

 

Think of the Space Shuttle.  Per https://www.nasa.gov/pdf/566250main_2011.07.05%20SHUTTLE%20ERA%20FACTS.pdf there were about two and a half million parts in the vehicle.  The parts in a space vehicle are classified as class ‘S’ parts. ( http://engineer.jpl.nasa.gov/practices/1203.pdf )  These are the highest reliability parts available.  Now, suppose they had a reliability of 99.999%.  That means there is a failure rate of 0.001%.  Thus, there is a probability that 2,500 parts may be failing.

There are ways to figure out how to increase the reliability.  The usual first attempt is to install a redundant system.  However, this is often not done correctly.  Suppose you have two power supplies in your home computer.  If one fails then the other will take over for it.  But, they are plugged into the same outlet.  A common mode failure of the power in the outlet will still cause the system to fail.

Then there is cross-coupling.  Suppose have two computers monitoring a device.  They also send a message to each other to make sure the other is alive.  They are both on different power systems and both have different mechanisms to observe the device they are monitoring.  If anything should fail, they are to send out a message to headquarters telling of the need to attend to the device.  Now suppose that the first computer has an electrical fault and it burns out its little electronic brain.  Along with its conflagration,  the electrical fault travels down the communications connection between the two monitors and deals a similar fatal blow to the second computer.  Nothing is sent back to mission control and the critical part that is being monitored falls into a silent death.

Perhaps the end all in reliability analysis is the Failure Mode Effect and Criticality Analysis http://rsdo.gsfc.nasa.gov/documents/Rapid-III-Documents/MAR-Reference/GSFC-FAP-322-208-FMEA-Draft.pdf .  This analysis is performed in both spacecraft as well as in medical life-support devices.  At issue with the FMECA is that it costs a great deal if the system is not documented.  Therefore, if you are designing a system from scratch, it is best to include an FMECA just after the first high-level design review and use the results to iterate the findings back into the requirements.

Here is another fun thing to do.  Take a drawing of your system and then X out one of the components.  Was any type of alert sent to notify of the failure?  If not, then move up the chain and fail another component until something is noticed.  Does your system fail catastrophically?  What happens if you do notice the failure but the Mean Time To Repair (MTTR) is longer than the Mean Time Between Failure (MTBF) of the components?  This means that, even if you have a backup system, if you don’t repair the failed system before there is a high probability of failure of the backup then the whole system may go down.

Ok.  So what is all of this stuff worth?  If you are not in the technology industry, think of an expensive item you buy that is very complex.  Lets say, your laptop.  Yes.  You have just purchased a new laptop.  You should just start using it right?  Wrong.  These things have limited warranty and are cranked out in the thousands per year.  The warranty is usually for 1 year.

Most failures occur within the first 100 hours of the life of a part.  Thus, if you have been shipped a lemon laptop, it would behoove you to turn the thing on and make it so that it comes out of power fail mode and stays oun (or buy a Wiebetech Mouse Jiggler http://www.amazon.com/CRU-Inc-30200-0100-0011-WiebeTech-Jiggler/dp/B000O3S0PK) .  It is also good if you can have it do something while it is powered up for 100 hours but just letting it run and ensuring its disk drive is spinning is good enough.  If it makes it through the first 100 hours then it will most likely survive until it wears out.

Another cup?