From: Dave Peterson Subject: using heat gun to cause ecc memory errors Date: 2005-06-20 14:26 I thought I"d share my recent experiences using a heat gun to cause ecc memory errors. Perhaps others will find this to be a convenient means of testing their code. What follows is a detailed description of the equipment and technique I am using. To make my results easy to reproduce, I will attempt to be as detailed as possible. The memory that I am using is Corsair PC-2700 double-sided registered ECC (512 Mb per stick). The heat gun is a "Master Heat Gun" (model # HG 501) made by Master Appliance Corporation (http://www.masterappliance.com/). According to the manufacturer"s web site, the product has the following specifications: Temperature: 500-750 degress F (260-400 degrees C) Volts: 120 AC (60 Hz) Current: 14 amps Power: 1680 watts On the side of the device, there is a dial that may be turned to control the width of a number of ventilation slots that regulate the amount of airflow into the device. Increasing the width of the slots decreases the temperature of the air that blows from the nozzle. I open the slots as wide as possible to produce the minimum temperature. The procedure I use is as follows: 1. Boot the machine and make sure the ecc or bluesmoke module for your chipset is loaded. I like to increase the ECC error polling frequency to 1 msec. although this not strictly necessary. 2. Execute the C program below. As a command-line argument, feed it a number that is close to the amount of physical memory in your machine. Running "top" should show that essentially all of physical memory is allocated, the C program is using roughly 99% of a CPU, and little or no paging activity is occurring. 3. Adjust the dial on the heat gun so that the ventilation slots are opened as wide as possible and the air temperature is minimized. Be sure not to forget this step! I haven"t tried it with the slots partially or fully closed. Given the amount of heat that the gun can produce, I would be concerned that partially or fully closing the slots may be enough to melt something or start a fire. Be careful when using the heat gun. The air that blows from the nozzle is hot enough to cause pain if you hold your hand in front of it for a few seconds. The nozzle gets hot enough that you can burn yourself by touching it accidentally. 4. Turn on the heat gun and start blowing hot air onto the surface of one of the DIMMs. I hold the nozzle roughly 2 1/2 inches from the surface of the DIMM and make a back and forth sweeping motion across the DIMM (approximately 2 seconds per sweep from one end of the DIMM to the other). After doing this for approximately a minute and a half, I start seeing printk() messages on the console indicating single-bit ECC errors at a rate of somewhere between (one error every several seconds) and (several errors per second). Once I start seeing ECC errors, I stop blowing hot air onto the DIMM, just to be safe and minimize the chances of damaging something. I have found the above procedure to be a very reliable means of causing single-bit ECC errors. However I state the following disclaimer: Before you attempt to perform the above steps, please remember that you are doing so at your own risk. If you aren"t careful with the heat gun, you may possibly burn yourself, damage your hardware, or perhaps even start a fire. As an additional warning, I have absolutely no idea how repeated use of this technique may affect the performance or reliability of your hardware. It seems reasonable to expect that repeated exposure to a heat gun may substantially shorten the lifetime of you DIMMs or your motherboard. Dave #include < stdio.h> #include < stdlib.h> void usage (void) { fprintf(stderr, "Usage: mem SIZE\n\n" " SIZE: number of bytes to malloc()\n"); exit(1); } int main (int argc, char **argv) { long size; char *endptr; int *buf, i, n, sum; if ((argc != 2) || (argv[1][0] == "\0")) usage(); size = strtol(argv[1], &endptr, 0); if (*endptr || (size < 0)) usage(); if ((buf = (int *) malloc(size)) == NULL) { fprintf(stderr, "malloc() failed\n"); return 1; } n = size / sizeof(*buf); /* The code below is somewhat ad-hoc. My goal is just to repeat * the following steps forever: * * 1. Go through roughly all of physical memory, doing reads. * 2. Go through roughly all of physical memory, doing writes. */ for (i = 0; i < n; i++) buf[i] = i; for (sum = 0; ; ) { for (i = 0; i < n; i++) sum += buf[i]; for (i = 0; i < n; i++) buf[i] += i + sum; } return 0; }