However, Eric Landry informed me that this is not always the case, he showed that a straight-forward implementation is faster than the standard one by 9%, in API CS20D dual Alpha 833MHz running NetBSD 1.6.1 with gcc 2.95.3 with optimization level O2. This implementation has fewer memory accesses by avoiding unnecessary reload.
Here is Eric's code named mt19937ar-nrl.c (nrl for "no reload"). Several optimizations such as a clever technique "^(-(*p1 & 1) & MATRIX_A);" are adopted. The code is slower than the standard code by 8%, in my experiments using cygwin + gcc.