Advanced Computing Platform for Theoretical Physics

Commit 15bfc7cc authored by rbabich's avatar rbabich
Browse files

quda: changed where autotuned parameters are set so that initBlas() is

only necessary before doing reductions.


git-svn-id: http://lattice.bu.edu/qcdalg/cuda/quda@613 be54200a-260c-0410-bdd7-ce6af2a381ab
parent 7ac1f1ee
...@@ -16,7 +16,7 @@ BiCGstab are provided, with support for double, single, and half ...@@ -16,7 +16,7 @@ BiCGstab are provided, with support for double, single, and half
Software compatibility: Software compatibility:
The library has been tested under linux (CentOS 5.3 and Ubuntu 8.04) The library has been tested under Linux (CentOS 5.3 and Ubuntu 8.04)
using release 2.3 of the CUDA toolkit. There are known issues with using release 2.3 of the CUDA toolkit. There are known issues with
releases 2.1 and 2.2, but 2.0 should work if one is forced to use an releases 2.1 and 2.2, but 2.0 should work if one is forced to use an
older version (for compatibility with an old driver, for example). older version (for compatibility with an old driver, for example).
...@@ -45,6 +45,24 @@ edit the first few lines to specify the CUDA install path, the ...@@ -45,6 +45,24 @@ edit the first few lines to specify the CUDA install path, the
platform (x86 or x86_64), and the GPU architecture (see "Hardware platform (x86 or x86_64), and the GPU architecture (see "Hardware
compatibility" above). Then type 'make' to build the library. compatibility" above). Then type 'make' to build the library.
As an optional step, 'make tune' will invoke tests/blas_test to
perform autotuning of the various BLAS-like functions needed by the
inverters. This involves testing many combinations of parameters
(corresponding to different numbers of CUDA threads per block and
blocks per grid for each kernel) and writing the optimal values to
lib/blas_param.h. The new values will take effect the next time the
library is built. Ideally, the autotuning should be performed on the
machine where the library is to be used, since the optimal parameters
will depend on the CUDA device and host hardware.
In summary, for an optimized install, run
make && make tune && make
By default, the autotuning is performed using CUDA device 0. To
select a different device number, set DEVICE in make.inc
appropriately.
Using the library: Using the library:
...@@ -53,21 +71,16 @@ against lib/libquda.a, and study tests/invert_test.c for an example of ...@@ -53,21 +71,16 @@ against lib/libquda.a, and study tests/invert_test.c for an example of
the interface. The various inverter options are enumerated in the interface. The various inverter options are enumerated in
include/enum_quda.h. include/enum_quda.h.
The lib/blas_quda.cu file contains all of the BLAS-like functions
required for the inverters. The threads per block and blocks per grid
parameters are auto-tuned using the blas_test function in tests/, and
the output stored in blas_param.h which is included here. These
optimal values may change as a function of the CUDA device and the
host hardware, so re-running blas_test and copying over the output
blas_param.h into lib/ and recompiling the blas library may provide
extra performance.
Known issues: Known issues:
* One of the stages of the build process requires over 5 GB of memory. * When building for the 'sm_13' GPU architecture (which enables double
If too little memory is available, the compilation will either take precision support), one of the stages in the build process requires
a very long time (given enough swap space) or fail completely. over 5 GB of memory. If too little memory is available, the
compilation will either take a very long time (given enough swap
space) or fail completely. In addition, the CUDA C compiler
requires over 1 GB of disk space in /tmp for the creation of
temporary files.
* For compatibility with CUDA, on 32-bit platforms the library is compiled * For compatibility with CUDA, on 32-bit platforms the library is compiled
with the GCC option -malign-double. This differs from the GCC default with the GCC option -malign-double. This differs from the GCC default
...@@ -88,7 +101,7 @@ M. A. Clark, R. Babich, K. Barros, R. Brower, and C. Rebbi, "Solving ...@@ -88,7 +101,7 @@ M. A. Clark, R. Babich, K. Barros, R. Brower, and C. Rebbi, "Solving
Lattice QCD systems of equations using mixed precision solvers on Lattice QCD systems of equations using mixed precision solvers on
GPUs" (2009), arXiv:0911.3191 [hep-lat]. GPUs" (2009), arXiv:0911.3191 [hep-lat].
Please also drop us a note so that we can inform you of updates and Please also drop us a note so that we may inform you of updates and
bug-fixes. The most recent public release will always be available bug-fixes. The most recent public release will always be available
online at http://lattice.bu.edu/quda/ online at http://lattice.bu.edu/quda/
...@@ -11,10 +11,9 @@ QUDA_HDRS = blas_quda.h clover_quda.h dslash_quda.h enum_quda.h gauge_quda.h \ ...@@ -11,10 +11,9 @@ QUDA_HDRS = blas_quda.h clover_quda.h dslash_quda.h enum_quda.h gauge_quda.h \
# files containing complex macros and other code fragments to be inlined, # files containing complex macros and other code fragments to be inlined,
# found in lib/ # found in lib/
QUDA_INLN = blas_param.h check_params.h clover_def.h dslash_common.h \ QUDA_INLN = check_params.h clover_def.h dslash_common.h dslash_def.h \
dslash_def.h dslash_textures.h io_spinor.h read_clover.h \ dslash_textures.h io_spinor.h read_clover.h read_gauge.h \
read_gauge.h reduce_complex_core.h reduce_core.h \ reduce_complex_core.h reduce_core.h reduce_triple_core.h
reduce_triple_core.h
# files generated by the scripts in lib/generate/, found in lib/dslash_core/ # files generated by the scripts in lib/generate/, found in lib/dslash_core/
# (The current clover_core.h was edited by hand.) # (The current clover_core.h was edited by hand.)
...@@ -41,6 +40,9 @@ clean: ...@@ -41,6 +40,9 @@ clean:
%.o: %.cpp $(HDRS) %.o: %.cpp $(HDRS)
$(CXX) $(CXXFLAGS) $< -c -o $@ $(CXX) $(CXXFLAGS) $< -c -o $@
blas_quda.o: blas_quda.cu blas_param.h $(HDRS)
$(NVCC) $(NVCCFLAGS) $< -c -o $@
%.o: %.cu $(HDRS) $(CORE) %.o: %.cu $(HDRS) $(CORE)
$(NVCC) $(NVCCFLAGS) $< -c -o $@ $(NVCC) $(NVCCFLAGS) $< -c -o $@
......
/* //
Auto-tuned blas CUDA parameters, generated by blas_test // Auto-tuned blas CUDA parameters, generated by blas_test
*/ //
// Kernel: copyCuda
blas_threads[0][0] = 64; static int blas_threads[22][3] = {
blas_blocks[0][0] = 2048; { 64, 64, 64}, // Kernel 0: copyCuda
{ 64, 128, 64}, // Kernel 1: axpbyCuda
// Kernel: axpbyCuda { 64, 128, 64}, // Kernel 2: xpyCuda
blas_threads[0][1] = 64; { 64, 128, 64}, // Kernel 3: axpyCuda
blas_blocks[0][1] = 2048; { 64, 128, 64}, // Kernel 4: xpayCuda
{ 64, 128, 64}, // Kernel 5: mxpyCuda
// Kernel: xpyCuda { 64, 64, 64}, // Kernel 6: axCuda
blas_threads[0][2] = 64; { 64, 64, 64}, // Kernel 7: caxpyCuda
blas_blocks[0][2] = 2048; { 64, 64, 64}, // Kernel 8: caxpbyCuda
{ 64, 64, 64}, // Kernel 9: cxpaypbzCuda
// Kernel: axpyCuda { 64, 128, 64}, // Kernel 10: axpyZpbxCuda
blas_threads[0][3] = 64; { 64, 64, 64}, // Kernel 11: caxpbypzYmbwCuda
blas_blocks[0][3] = 2048; { 64, 128, 128}, // Kernel 12: sumCuda
{ 64, 128, 128}, // Kernel 13: normCuda
// Kernel: xpayCuda { 64, 128, 128}, // Kernel 14: reDotProductCuda
blas_threads[0][4] = 64; { 64, 128, 64}, // Kernel 15: axpyNormCuda
blas_blocks[0][4] = 2048; { 64, 128, 64}, // Kernel 16: xmyNormCuda
{ 64, 128, 64}, // Kernel 17: cDotProductCuda
// Kernel: mxpyCuda { 64, 64, 64}, // Kernel 18: xpaycDotzyCuda
blas_threads[0][5] = 64; { 64, 64, 64}, // Kernel 19: cDotProductNormACuda
blas_blocks[0][5] = 2048; { 64, 64, 64}, // Kernel 20: cDotProductNormBCuda
{ 64, 64, 64} // Kernel 21: caxpbypzYmbwcDotProductWYNormYQuda
// Kernel: axCuda };
blas_threads[0][6] = 64;
blas_blocks[0][6] = 2048; static int blas_blocks[22][3] = {
{2048, 1024, 128}, // Kernel 0: copyCuda
// Kernel: caxpyCuda {2048, 128, 128}, // Kernel 1: axpbyCuda
blas_threads[0][7] = 64; {2048, 128, 128}, // Kernel 2: xpyCuda
blas_blocks[0][7] = 2048; {2048, 128, 128}, // Kernel 3: axpyCuda
{2048, 128, 128}, // Kernel 4: xpayCuda
// Kernel: caxpbyCuda {2048, 128, 128}, // Kernel 5: mxpyCuda
blas_threads[0][8] = 64; {2048, 128, 2048}, // Kernel 6: axCuda
blas_blocks[0][8] = 2048; {2048, 128, 2048}, // Kernel 7: caxpyCuda
{2048, 128, 2048}, // Kernel 8: caxpbyCuda
// Kernel: cxpaypbzCuda {2048, 128, 2048}, // Kernel 9: cxpaypbzCuda
blas_threads[0][9] = 64; { 512, 128, 128}, // Kernel 10: axpyZpbxCuda
blas_blocks[0][9] = 2048; {1024, 128, 128}, // Kernel 11: caxpbypzYmbwCuda
{ 128, 1024, 128}, // Kernel 12: sumCuda
// Kernel: axpyZpbxCuda { 128, 1024, 128}, // Kernel 13: normCuda
blas_threads[0][10] = 64; { 64, 1024, 128}, // Kernel 14: reDotProductCuda
blas_blocks[0][10] = 512; { 256, 1024, 128}, // Kernel 15: axpyNormCuda
{ 512, 1024, 128}, // Kernel 16: xmyNormCuda
// Kernel: caxpbypzYmbwCuda { 64, 512, 128}, // Kernel 17: cDotProductCuda
blas_threads[0][11] = 64; { 256, 128, 128}, // Kernel 18: xpaycDotzyCuda
blas_blocks[0][11] = 1024; { 64, 1024, 128}, // Kernel 19: cDotProductNormACuda
{ 64, 1024, 128}, // Kernel 20: cDotProductNormBCuda
// Kernel: sumCuda { 512, 128, 256} // Kernel 21: caxpbypzYmbwcDotProductWYNormYQuda
blas_threads[0][12] = 64; };
blas_blocks[0][12] = 128;
// Kernel: normCuda
blas_threads[0][13] = 64;
blas_blocks[0][13] = 128;
// Kernel: reDotProductCuda
blas_threads[0][14] = 64;
blas_blocks[0][14] = 64;
// Kernel: axpyNormCuda
blas_threads[0][15] = 64;
blas_blocks[0][15] = 256;
// Kernel: xmyNormCuda
blas_threads[0][16] = 64;
blas_blocks[0][16] = 512;
// Kernel: cDotProductCuda
blas_threads[0][17] = 64;
blas_blocks[0][17] = 64;
// Kernel: xpaycDotzyCuda
blas_threads[0][18] = 64;
blas_blocks[0][18] = 256;
// Kernel: cDotProductNormACuda
blas_threads[0][19] = 64;
blas_blocks[0][19] = 64;
// Kernel: cDotProductNormBCuda
blas_threads[0][20] = 64;
blas_blocks[0][20] = 64;
// Kernel: caxpbypzYmbwcDotProductWYNormYQuda
blas_threads[0][21] = 64;
blas_blocks[0][21] = 512;
// Kernel: copyCuda
blas_threads[1][0] = 64;
blas_blocks[1][0] = 1024;
// Kernel: axpbyCuda
blas_threads[1][1] = 128;
blas_blocks[1][1] = 128;
// Kernel: xpyCuda
blas_threads[1][2] = 128;
blas_blocks[1][2] = 128;
// Kernel: axpyCuda
blas_threads[1][3] = 128;
blas_blocks[1][3] = 128;
// Kernel: xpayCuda
blas_threads[1][4] = 128;
blas_blocks[1][4] = 128;
// Kernel: mxpyCuda
blas_threads[1][5] = 128;
blas_blocks[1][5] = 128;
// Kernel: axCuda
blas_threads[1][6] = 64;
blas_blocks[1][6] = 128;
// Kernel: caxpyCuda
blas_threads[1][7] = 64;
blas_blocks[1][7] = 128;
// Kernel: caxpbyCuda
blas_threads[1][8] = 64;
blas_blocks[1][8] = 128;
// Kernel: cxpaypbzCuda
blas_threads[1][9] = 64;
blas_blocks[1][9] = 128;
// Kernel: axpyZpbxCuda
blas_threads[1][10] = 64;
blas_blocks[1][10] = 2048;
// Kernel: caxpbypzYmbwCuda
blas_threads[1][11] = 64;
blas_blocks[1][11] = 128;
// Kernel: sumCuda
blas_threads[1][12] = 128;
blas_blocks[1][12] = 1024;
// Kernel: normCuda
blas_threads[1][13] = 128;
blas_blocks[1][13] = 1024;
// Kernel: reDotProductCuda
blas_threads[1][14] = 128;
blas_blocks[1][14] = 1024;
// Kernel: axpyNormCuda
blas_threads[1][15] = 128;
blas_blocks[1][15] = 1024;
// Kernel: xmyNormCuda
blas_threads[1][16] = 128;
blas_blocks[1][16] = 1024;
// Kernel: cDotProductCuda
blas_threads[1][17] = 128;
blas_blocks[1][17] = 512;
// Kernel: xpaycDotzyCuda
blas_threads[1][18] = 64;
blas_blocks[1][18] = 128;
// Kernel: cDotProductNormACuda
blas_threads[1][19] = 64;
blas_blocks[1][19] = 1024;
// Kernel: cDotProductNormBCuda
blas_threads[1][20] = 64;
blas_blocks[1][20] = 1024;
// Kernel: caxpbypzYmbwcDotProductWYNormYQuda
blas_threads[1][21] = 64;
blas_blocks[1][21] = 128;
// Kernel: copyCuda
blas_threads[2][0] = 64;
blas_blocks[2][0] = 128;
// Kernel: axpbyCuda
blas_threads[2][1] = 64;
blas_blocks[2][1] = 128;
// Kernel: xpyCuda
blas_threads[2][2] = 64;
blas_blocks[2][2] = 128;
// Kernel: axpyCuda
blas_threads[2][3] = 64;
blas_blocks[2][3] = 128;
// Kernel: xpayCuda
blas_threads[2][4] = 64;
blas_blocks[2][4] = 128;
// Kernel: mxpyCuda
blas_threads[2][5] = 64;
blas_blocks[2][5] = 128;
// Kernel: axCuda
blas_threads[2][6] = 64;
blas_blocks[2][6] = 2048;
// Kernel: caxpyCuda
blas_threads[2][7] = 64;
blas_blocks[2][7] = 2048;
// Kernel: caxpbyCuda
blas_threads[2][8] = 64;
blas_blocks[2][8] = 2048;
// Kernel: cxpaypbzCuda
blas_threads[2][9] = 64;
blas_blocks[2][9] = 2048;
// Kernel: axpyZpbxCuda
blas_threads[2][10] = 64;
blas_blocks[2][10] = 128;
// Kernel: caxpbypzYmbwCuda
blas_threads[2][11] = 64;
blas_blocks[2][11] = 128;
// Kernel: sumCuda
blas_threads[2][12] = 128;
blas_blocks[2][12] = 128;
// Kernel: normCuda
blas_threads[2][13] = 128;
blas_blocks[2][13] = 128;
// Kernel: reDotProductCuda
blas_threads[2][14] = 128;
blas_blocks[2][14] = 128;
// Kernel: axpyNormCuda
blas_threads[2][15] = 64;
blas_blocks[2][15] = 128;
// Kernel: xmyNormCuda
blas_threads[2][16] = 64;
blas_blocks[2][16] = 128;
// Kernel: cDotProductCuda
blas_threads[2][17] = 64;
blas_blocks[2][17] = 128;
// Kernel: xpaycDotzyCuda
blas_threads[2][18] = 64;
blas_blocks[2][18] = 128;
// Kernel: cDotProductNormACuda
blas_threads[2][19] = 64;
blas_blocks[2][19] = 128;
// Kernel: cDotProductNormBCuda
blas_threads[2][20] = 64;
blas_blocks[2][20] = 128;
// Kernel: caxpbypzYmbwcDotProductWYNormYQuda
blas_threads[2][21] = 64;
blas_blocks[2][21] = 256;
...@@ -33,14 +33,12 @@ static QudaSumFloat3 *h_reduceFloat3=0; ...@@ -33,14 +33,12 @@ static QudaSumFloat3 *h_reduceFloat3=0;
unsigned long long blas_quda_flops; unsigned long long blas_quda_flops;
unsigned long long blas_quda_bytes; unsigned long long blas_quda_bytes;
// Number of threads used for each blas kernel
static int blas_threads[3][22];
// Number of thread blocks for each blas kernel
static int blas_blocks[3][22];
static dim3 blasBlock; static dim3 blasBlock;
static dim3 blasGrid; static dim3 blasGrid;
// generated by blas_test
#include <blas_param.h>
void initBlas(void) void initBlas(void)
{ {
if (!d_reduceFloat) { if (!d_reduceFloat) {
...@@ -78,10 +76,6 @@ void initBlas(void) ...@@ -78,10 +76,6 @@ void initBlas(void)
errorQuda("Error allocating host reduction array"); errorQuda("Error allocating host reduction array");
} }
} }
// Output from blas_test
#include <blas_param.h>
} }
void endBlas(void) void endBlas(void)
...@@ -104,11 +98,12 @@ void setBlasTuning(int tuning) ...@@ -104,11 +98,12 @@ void setBlasTuning(int tuning)
void setBlasParam(int kernel, int prec, int threads, int blocks) void setBlasParam(int kernel, int prec, int threads, int blocks)
{ {
blas_threads[prec][kernel] = threads; blas_threads[kernel][prec] = threads;
blas_blocks[prec][kernel] = blocks; blas_blocks[kernel][prec] = blocks;
} }
void setBlock(int kernel, int length, QudaPrecision precision) { void setBlock(int kernel, int length, QudaPrecision precision)
{
int prec; int prec;
switch(precision) { switch(precision) {
case QUDA_HALF_PRECISION: case QUDA_HALF_PRECISION:
...@@ -122,8 +117,8 @@ void setBlock(int kernel, int length, QudaPrecision precision) { ...@@ -122,8 +117,8 @@ void setBlock(int kernel, int length, QudaPrecision precision) {
break; break;
} }
int blocks = min(blas_blocks[prec][kernel], max(length/blas_threads[prec][kernel], 1)); int blocks = min(blas_blocks[kernel][prec], max(length/blas_threads[kernel][prec], 1));
blasBlock.x = blas_threads[prec][kernel]; blasBlock.x = blas_threads[kernel][prec];
blasBlock.y = 1; blasBlock.y = 1;
blasBlock.z = 1; blasBlock.z = 1;
......
...@@ -7,6 +7,8 @@ ...@@ -7,6 +7,8 @@
#include <test_util.h> #include <test_util.h>
#define Nkernels 22
QudaPrecision cuda_prec; QudaPrecision cuda_prec;
QudaPrecision other_prec; // Used for copy benchmark QudaPrecision other_prec; // Used for copy benchmark
ParitySpinor x, y, z, w, v, p; ParitySpinor x, y, z, w, v, p;
...@@ -20,8 +22,8 @@ int gridSizes[] = {64, 128, 256, 512, 1024, 2048}; ...@@ -20,8 +22,8 @@ int gridSizes[] = {64, 128, 256, 512, 1024, 2048};
int prec; int prec;
void init() { void init()
{
int X[4]; int X[4];
X[0] = 24; X[0] = 24;
...@@ -63,7 +65,9 @@ void init() { ...@@ -63,7 +65,9 @@ void init() {
setBlasTuning(1); setBlasTuning(1);
} }
void end() {
void end()
{
// release memory // release memory
freeParitySpinor(p); freeParitySpinor(p);
freeParitySpinor(v); freeParitySpinor(v);
...@@ -73,6 +77,7 @@ void end() { ...@@ -73,6 +77,7 @@ void end() {
freeParitySpinor(z); freeParitySpinor(z);
} }
double benchmark(int kernel) { double benchmark(int kernel) {
double a, b; double a, b;
...@@ -195,14 +200,44 @@ double benchmark(int kernel) { ...@@ -195,14 +200,44 @@ double benchmark(int kernel) {
} }
int main(int argc, char** argv) { void write(char *names[], int threads[][3], int blocks[][3])
{
printf("\nWriting optimal parameters to blas_param.h\n");
FILE *fp = fopen("blas_param.h", "w");
fprintf(fp, "//\n// Auto-tuned blas CUDA parameters, generated by blas_test\n//\n\n");
fprintf(fp, "static int blas_threads[%d][3] = {\n", Nkernels);
for (int i=0; i<Nkernels; i++) {
fprintf(fp, " {%4d, %4d, %4d}%c // Kernel %2d: %s\n", threads[i][0], threads[i][1], threads[i][2],
((i == Nkernels-1) ? ' ' : ','), i, names[i]);
}
fprintf(fp, "};\n\n");
fprintf(fp, "static int blas_blocks[%d][3] = {\n", Nkernels);
for (int i=0; i<Nkernels; i++) {
fprintf(fp, " {%4d, %4d, %4d}%c // Kernel %2d: %s\n", blocks[i][0], blocks[i][1], blocks[i][2],
((i == Nkernels-1) ? ' ' : ','), i, names[i]);
}
fprintf(fp, "};\n");
fclose(fp);
}
int main(int argc, char** argv)
{
int dev = 0; int dev = 0;
if (argc == 2) dev = atoi(argv[1]); if (argc == 2) dev = atoi(argv[1]);
initQuda(dev); initQuda(dev);
int threads[Nkernels][3];
int blocks[Nkernels][3];
int kernels[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21}; int kernels[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21};
char names[][100] = { char *names[] = {
"copyCuda", "copyCuda",
"axpbyCuda", "axpbyCuda",
"xpyCuda", "xpyCuda",
...@@ -227,22 +262,21 @@ int main(int argc, char** argv) { ...@@ -227,22 +262,21 @@ int main(int argc, char** argv) {
"caxpbypzYmbwcDotProductWYNormYQuda" "caxpbypzYmbwcDotProductWYNormYQuda"
}; };
FILE *blas_out = fopen("blas_param.h", "w"); for (prec = 0; prec < 3; prec++) {
fprintf(blas_out, "/*\n Auto-tuned blas CUDA parameters, generated by blas_test\n*/\n");
for (prec = 0; prec<3; prec++) {
init(); init();
printf("\nBenchmarking %d bit precision\n", (int)(pow(2.0,prec)*16)); printf("\nBenchmarking %d bit precision\n", (int)(pow(2.0,prec)*16));
for (int i = 0; i <= 21; i++) { for (int i = 0; i < Nkernels; i++) {
double gflops_max = 0.0; double gflops_max = 0.0;
double gbytes_max = 0.0; double gbytes_max = 0.0;
int threads_max = 0; int threads_max = 0;
int blocks_max = 0; int blocks_max = 0;
for (int thread=0; thread<Nthreads; thread++) {
for (int grid=0; grid<Ngrids; grid++) { for (int thread = 0; thread < Nthreads; thread++) {
for (int grid = 0; grid < Ngrids; grid++) {
setBlasParam(i, prec, blockSizes[thread], gridSizes[grid]); setBlasParam(i, prec, blockSizes[thread], gridSizes[grid]);
// first do warmup run // first do warmup run
...@@ -269,27 +303,22 @@ int main(int argc, char** argv) { ...@@ -269,27 +303,22 @@ int main(int argc, char** argv) {
blocks_max = gridSizes[grid]; blocks_max = gridSizes[grid];
} }
//printf("%d %d %-36s %f s, flops = %e, Gflops/s = %f, GiB/s = %f\n\n", // printf("%d %d %-36s %f s, flops = %e, Gflops/s = %f, GiB/s = %f\n\n",
// blockSizes[thread], gridSizes[grid], names[i], secs, flops, gflops, gbytes); // blockSizes[thread], gridSizes[grid], names[i], secs, flops, gflops, gbytes);
} }
} }
if (threads_max == 0 || blocks_max == 0) if (threads_max == 0) errorQuda("Autotuning failed for %s kernel", names[i]);
errorQuda("Autotuning failed for %s kernel", names[i]);
printf("%-36s Performance maximum at %d threads per block, %d blocks per grid, Gflops/s = %f, GiB/s = %f\n", printf("%-36s Performance maximum at %d threads per block, %d blocks per grid, Gflops/s = %f, GiB/s = %f\n",
names[i], threads_max, blocks_max, gflops_max, gbytes_max); names[i], threads_max, blocks_max, gflops_max, gbytes_max);
fprintf(blas_out, "// Kernel: %s\n", names[i]); threads[i][prec] = threads_max;
fprintf(blas_out, "blas_threads[%d][%d] = %d;\n", prec, i, threads_max); blocks[i][prec] = blocks_max;
fprintf(blas_out, "blas_blocks[%d][%d] = %d;\n\n", prec, i, blocks_max);
} }
end(); end();
} }
write(names, threads, blocks);
fclose(blas_out);
endQuda(); endQuda();
} }
......
...@@ -31,9 +31,9 @@ void *spinorRef, *spinorRefEven, *spinorRefOdd; ...@@ -31,9 +31,9 @@ void *spinorRef, *spinorRefEven, *spinorRefOdd;
void *spinorGPU, *spinorGPUEven, *spinorGPUOdd; void *spinorGPU, *spinorGPUEven, *spinorGPUOdd;
double kappa = 1.0; double kappa = 1.0;
int ODD_BIT = 1; int parity = 1; // even or odd? (0 = even, 1 = odd)
int DAGGER_BIT = 0; int dagger = 0; // apply Dslash or Dslash dagger?
int TRANSFER = 0; // include transfer time in the benchmark? int transfer = 0; // include transfer time in the benchmark?
void init() { void init() {
...@@ -75,9 +75,9 @@ void init() { ...@@ -75,9 +75,9 @@ void init() {
inv_param.sp_pad = 0; inv_param.sp_pad = 0;
inv_param.cl_pad = 0; inv_param.cl_pad = 0;
/*gauge_param.ga_pad = 24*24*12; // gauge_param.ga_pad = 24*24*12;
inv_param.sp_pad = 24*24*12; // inv_param.sp_pad = 24*24*12;
inv_param.cl_pad = 24*24*12;*/ // inv_param.cl_pad = 24*24*12;
if (test_type == 2) inv_param.dirac_order = QUDA_DIRAC_ORDER; if (test_type == 2) inv_param.dirac_order = QUDA_DIRAC_ORDER;
else inv_param.dirac_order = QUDA_DIRAC_ORDER; else inv_param.dirac_order = QUDA_DIRAC_ORDER;
...@@ -150,7 +150,7 @@ void init() { ...@@ -150,7 +150,7 @@ void init() {
printf("Sending fields to GPU... "); fflush(stdout); printf("Sending fields to GPU... "); fflush(stdout);
if (!TRANSFER) { if (!transfer) {
gauge_param.X[0] /= 2; gauge_param.X[0] /= 2;
tmp = allocateParitySpinor(gauge_param.X, inv_param.cuda_prec, inv_param.sp_pad); tmp = allocateParitySpinor(gauge_param.X, inv_param.cuda_prec, inv_param.sp_pad);
...@@ -180,7 +180,7 @@ void end() { ...@@ -180,7 +180,7 @@ void end() {
free(spinorGPU); free(spinorGPU);
free(spinor); free(spinor);
free(spinorRef); free(spinorRef);
if (!TRANSFER) { if (!transfer) {
freeSpinorField(cudaSpinorOut); freeSpinorField(cudaSpinorOut);
freeSpinorField(cudaSpinor); freeSpinorField(cudaSpinor);
freeParitySpinor(tmp); freeParitySpinor(tmp);
...@@ -198,31 +198,31 @@ double dslashCUDA() { ...@@ -198,31 +198,31 @@ double dslashCUDA() {
for (int i = 0; i < LOOPS; i++) { for (int i = 0; i < LOOPS; i++) {
switch (test_type) { switch (test_type) {
case 0: case 0:
if (TRANSFER) { if (transfer) {
dslashQuda(spinorOdd, spinorEven, &inv_param, ODD_BIT, DAGGER_BIT); dslashQuda(spinorOdd, spinorEven, &inv_param, parity, dagger);
} else if (!clover_yes) { } else if (!clover_yes) {
dslashCuda(cudaSpinor.odd, gauge, cudaSpinor.even, ODD_BIT, DAGGER_BIT); dslashCuda(cudaSpinor.odd, gauge, cudaSpinor.even, parity, dagger);
} else { } else {
cloverDslashCuda(cudaSpinor.odd, gauge, cloverInv, cudaSpinor.even, ODD_BIT, DAGGER_BIT); cloverDslashCuda(cudaSpinor.odd, gauge, cloverInv, cudaSpinor.even, parity, dagger);
} }
break; break;
case 1: case 1:
if (TRANSFER) { if (transfer) {
MatPCQuda(spinorOdd, spinorEven, &inv_param, DAGGER_BIT); MatPCQuda(spinorOdd, spinorEven, &inv_param, dagger);
} else if (!clover_yes) { } else if (!clover_yes) {
MatPCCuda(cudaSpinor.odd, gauge, cudaSpinor.even, kappa, tmp, inv_param.matpc_type, DAGGER_BIT); MatPCCuda(cudaSpinor.odd, gauge, cudaSpinor.even, kappa, tmp, inv_param.matpc_type, dagger);
} else { } else {
cloverMatPCCuda(cudaSpinor.odd, gauge, clover, cloverInv, cudaSpinor.even, kappa, tmp, cloverMatPCCuda(cudaSpinor.odd, gauge, clover, cloverInv, cudaSpinor.even, kappa, tmp,
inv_param.matpc_type, DAGGER_BIT); inv_param.matpc_type, dagger);
} }
break; break;
case 2: case 2:
if (TRANSFER) { if (transfer) {
MatQuda(spinorGPU, spinor, &inv_param, DAGGER_BIT); MatQuda(spinorGPU, spinor, &inv_param, dagger);
} else if (!clover_yes) { } else if (!clover_yes) {
MatCuda(cudaSpinorOut, gauge, cudaSpinor, kappa, DAGGER_BIT); MatCuda(cudaSpinorOut, gauge, cudaSpinor, kappa, dagger);
} else { } else {
cloverMatCuda(cudaSpinorOut, gauge, clover, cudaSpinor, kappa, tmp, DAGGER_BIT); cloverMatCuda(cudaSpinorOut, gauge, clover, cudaSpinor, kappa, tmp, dagger);
} }
} }
} }
...@@ -253,15 +253,15 @@ void dslashRef() { ...@@ -253,15 +253,15 @@ void dslashRef() {
fflush(stdout); fflush(stdout);
switch (test_type) { switch (test_type) {
case 0: case 0:
dslash(spinorRef, hostGauge, spinorEven, ODD_BIT, DAGGER_BIT, dslash(spinorRef, hostGauge, spinorEven, parity, dagger,
inv_param.cpu_prec, gauge_param.cpu_prec); inv_param.cpu_prec, gauge_param.cpu_prec);
break; break;
case 1: case 1:
matpc(spinorRef, hostGauge, spinorEven, kappa, inv_param.matpc_type, DAGGER_BIT, matpc(spinorRef, hostGauge, spinorEven, kappa, inv_param.matpc_type, dagger,
inv_param.cpu_prec, gauge_param.cpu_prec); inv_param.cpu_prec, gauge_param.cpu_prec);
break; break;
case 2: case 2:
mat(spinorRef, hostGauge, spinor, kappa, DAGGER_BIT, mat(spinorRef, hostGauge, spinor, kappa, dagger,
inv_param.cpu_prec, gauge_param.cpu_prec); inv_param.cpu_prec, gauge_param.cpu_prec);
break; break;
default: default:
...@@ -273,8 +273,8 @@ void dslashRef() { ...@@ -273,8 +273,8 @@ void dslashRef() {
} }
void dslashTest() { int main(int argc, char **argv)
{
init(); init();
float spinorGiB = (float)Vh*spinorSiteSize*sizeof(inv_param.cpu_prec) / (1 << 30); float spinorGiB = (float)Vh*spinorSiteSize*sizeof(inv_param.cpu_prec) / (1 << 30);
...@@ -289,7 +289,7 @@ void dslashTest() { ...@@ -289,7 +289,7 @@ void dslashTest() {
double secs = dslashCUDA(); double secs = dslashCUDA();
if (!TRANSFER) { if (!transfer) {
if (test_type < 2) if (test_type < 2)
retrieveParitySpinor(spinorOdd, cudaSpinor.odd, inv_param.cpu_prec, inv_param.dirac_order); retrieveParitySpinor(spinorOdd, cudaSpinor.odd, inv_param.cpu_prec, inv_param.dirac_order);
else else
...@@ -319,7 +319,3 @@ void dslashTest() { ...@@ -319,7 +319,3 @@ void dslashTest() {
} }
end(); end();
} }
int main(int argc, char **argv) {
dslashTest();
}
...@@ -3,17 +3,18 @@ ...@@ -3,17 +3,18 @@
#include <time.h> #include <time.h>
#include <math.h> #include <math.h>
#include <quda.h>
#include <test_util.h> #include <test_util.h>
#include <blas_reference.h> #include <blas_reference.h>
#include <dslash_reference.h> #include <dslash_reference.h>
// in a typical application, quda.h is the only QUDA header required
#include <quda.h>
int main(int argc, char **argv) int main(int argc, char **argv)
{ {
int device = 0; // set QUDA parameters
void *gauge[4], *clover_inv; int device = 0; // CUDA device number
QudaGaugeParam gauge_param = newQudaGaugeParam(); QudaGaugeParam gauge_param = newQudaGaugeParam();
QudaInvertParam inv_param = newQudaInvertParam(); QudaInvertParam inv_param = newQudaInvertParam();
...@@ -22,7 +23,6 @@ int main(int argc, char **argv) ...@@ -22,7 +23,6 @@ int main(int argc, char **argv)
gauge_param.X[1] = 24; gauge_param.X[1] = 24;
gauge_param.X[2] = 24; gauge_param.X[2] = 24;
gauge_param.X[3] = 48; gauge_param.X[3] = 48;
setDims(gauge_param.X);
gauge_param.anisotropy = 1.0; gauge_param.anisotropy = 1.0;
gauge_param.gauge_order = QUDA_QDP_GAUGE_ORDER; gauge_param.gauge_order = QUDA_QDP_GAUGE_ORDER;
...@@ -72,9 +72,16 @@ int main(int argc, char **argv) ...@@ -72,9 +72,16 @@ int main(int argc, char **argv)
} }
inv_param.verbosity = QUDA_VERBOSE; inv_param.verbosity = QUDA_VERBOSE;
// Everything between here and the call to initQuda() is application-specific.
// set parameters for the reference Dslash, and prepare fields to be loaded
setDims(gauge_param.X);
size_t gSize = (gauge_param.cpu_prec == QUDA_DOUBLE_PRECISION) ? sizeof(double) : sizeof(float); size_t gSize = (gauge_param.cpu_prec == QUDA_DOUBLE_PRECISION) ? sizeof(double) : sizeof(float);
size_t sSize = (inv_param.cpu_prec == QUDA_DOUBLE_PRECISION) ? sizeof(double) : sizeof(float); size_t sSize = (inv_param.cpu_prec == QUDA_DOUBLE_PRECISION) ? sizeof(double) : sizeof(float);
void *gauge[4], *clover_inv;
for (int dir = 0; dir < 4; dir++) { for (int dir = 0; dir < 4; dir++) {
gauge[dir] = malloc(V*gaugeSiteSize*gSize); gauge[dir] = malloc(V*gaugeSiteSize*gSize);
} }
...@@ -98,12 +105,18 @@ int main(int argc, char **argv) ...@@ -98,12 +105,18 @@ int main(int argc, char **argv)
int c0 = 0; int c0 = 0;
construct_spinor_field(spinorIn, 1, i0, s0, c0, inv_param.cpu_prec); construct_spinor_field(spinorIn, 1, i0, s0, c0, inv_param.cpu_prec);
double time0 = -((double)clock()); // Start the timer double time0 = -((double)clock()); // start the timer
// initialize the QUDA library
initQuda(device); initQuda(device);
// load the gauge field
loadGaugeQuda((void*)gauge, &gauge_param); loadGaugeQuda((void*)gauge, &gauge_param);
// load the clover term, if desired
if (clover_yes) loadCloverQuda(NULL, clover_inv, &inv_param); if (clover_yes) loadCloverQuda(NULL, clover_inv, &inv_param);
// perform the inversion
invertQuda(spinorOut, spinorIn, &inv_param); invertQuda(spinorOut, spinorIn, &inv_param);
time0 += clock(); // stop the timer time0 += clock(); // stop the timer
...@@ -124,6 +137,7 @@ int main(int argc, char **argv) ...@@ -124,6 +137,7 @@ int main(int argc, char **argv)
double src2 = norm_2(spinorIn, V*spinorSiteSize, inv_param.cpu_prec); double src2 = norm_2(spinorIn, V*spinorSiteSize, inv_param.cpu_prec);
printf("Relative residual, requested = %g, actual = %g\n", inv_param.tol, sqrt(nrm2/src2)); printf("Relative residual, requested = %g, actual = %g\n", inv_param.tol, sqrt(nrm2/src2));
// finalize the QUDA library
endQuda(); endQuda();
return 0; return 0;
......
...@@ -4,7 +4,6 @@ ...@@ -4,7 +4,6 @@
#include <quda_internal.h> #include <quda_internal.h>
#include <gauge_quda.h> #include <gauge_quda.h>
#include <spinor_quda.h> #include <spinor_quda.h>
#include <blas_quda.h>
#include <util_quda.h> #include <util_quda.h>
#include <test_util.h> #include <test_util.h>
...@@ -71,7 +70,6 @@ void init() { ...@@ -71,7 +70,6 @@ void init() {
int dev = 0; int dev = 0;
cudaSetDevice(dev); cudaSetDevice(dev);
initBlas();
param.X[0] /= 2; param.X[0] /= 2;
cudaFullSpinor = allocateSpinorField(param.X, param.cuda_prec, sp_pad); cudaFullSpinor = allocateSpinorField(param.X, param.cuda_prec, sp_pad);
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment