From: Dave Peterson
Subject: using heat gun to cause ecc memory errors  
Date: 2005-06-20 14:26

I thought I"d share my recent experiences using a heat gun to cause
ecc memory errors.  Perhaps others will find this to be a convenient
means of testing their code.  What follows is a detailed description
of the equipment and technique I am using.  To make my results easy
to reproduce, I will attempt to be as detailed as possible.

The memory that I am using is Corsair PC-2700 double-sided
registered ECC (512 Mb per stick).  The heat gun is a "Master Heat
Gun" (model # HG 501) made by Master Appliance Corporation
(http://www.masterappliance.com/).  According to the manufacturer"s
web site, the product has the following specifications:

    Temperature: 500-750 degress F (260-400 degrees C)
    Volts: 120 AC (60 Hz)
    Current: 14 amps
    Power: 1680 watts

On the side of the device, there is a dial that may be turned to
control the width of a number of ventilation slots that regulate the
amount of airflow into the device.  Increasing the width of the slots
decreases the temperature of the air that blows from the nozzle.  I
open the slots as wide as possible to produce the minimum temperature.

The procedure I use is as follows:

    1.  Boot the machine and make sure the ecc or bluesmoke module for
        your chipset is loaded.  I like to increase the ECC error
        polling frequency to 1 msec. although this not strictly
        necessary.

    2.  Execute the C program below.  As a command-line argument, feed
        it a number that is close to the amount of physical memory in
        your machine.  Running "top" should show that essentially all
        of physical memory is allocated, the C program is using roughly
        99% of a CPU, and little or no paging activity is occurring.

    3.  Adjust the dial on the heat gun so that the ventilation slots
        are opened as wide as possible and the air temperature is
        minimized.  Be sure not to forget this step!  I haven"t tried
        it with the slots partially or fully closed.  Given the amount
        of heat that the gun can produce, I would be concerned that
        partially or fully closing the slots may be enough to melt
        something or start a fire.

        Be careful when using the heat gun.  The air that blows from
        the nozzle is hot enough to cause pain if you hold your hand
        in front of it for a few seconds.  The nozzle gets hot enough
        that you can burn yourself by touching it accidentally.

    4.  Turn on the heat gun and start blowing hot air onto the surface
        of one of the DIMMs.  I hold the nozzle roughly 2 1/2 inches
        from the surface of the DIMM and make a back and forth sweeping
        motion across the DIMM (approximately 2 seconds per sweep from
        one end of the DIMM to the other).  After doing this for
        approximately a minute and a half, I start seeing printk()
        messages on the console indicating single-bit ECC errors at a
        rate of somewhere between (one error every several seconds) and
        (several errors per second).  Once I start seeing ECC errors, I
        stop blowing hot air onto the DIMM, just to be safe and
        minimize the chances of damaging something.

I have found the above procedure to be a very reliable means of causing
single-bit ECC errors.  However I state the following disclaimer:

    Before you attempt to perform the above steps, please remember that
    you are doing so at your own risk.  If you aren"t careful with the
    heat gun, you may possibly burn yourself, damage your hardware, or
    perhaps even start a fire.  As an additional warning, I have
    absolutely no idea how repeated use of this technique may affect
    the performance or reliability of your hardware.  It seems
    reasonable to expect that repeated exposure to a heat gun may
    substantially shorten the lifetime of you DIMMs or your motherboard.

Dave


#include < stdio.h>
#include < stdlib.h>

void usage (void)
 { fprintf(stderr,
           "Usage: mem SIZE\n\n"
           "       SIZE: number of bytes to malloc()\n");
   exit(1);
 }

int main (int argc, char **argv)
 { long size;
   char *endptr;
   int *buf, i, n, sum;

   if ((argc != 2) || (argv[1][0] == "\0"))
      usage();

   size = strtol(argv[1], &endptr, 0);

   if (*endptr || (size < 0))
      usage();

   if ((buf = (int *) malloc(size)) == NULL)
    { fprintf(stderr, "malloc() failed\n");
      return 1;
    }

   n = size / sizeof(*buf);

   /* The code below is somewhat ad-hoc.  My goal is just to repeat
    * the following steps forever:
    *
    *     1.  Go through roughly all of physical memory, doing reads.
    *     2.  Go through roughly all of physical memory, doing writes.
    */

   for (i = 0; i < n; i++)
      buf[i] = i;

   for (sum = 0; ; )
    { for (i = 0; i < n; i++)
         sum += buf[i];

      for (i = 0; i < n; i++)
         buf[i] += i + sum;
    }

   return 0;
 }