So far I have implemented the scoring function evaluation on the GPU, which includes calculating the current energies and forces experienced by all movable atoms. This utilizes a warp-based reduction over all atoms using SHUFL intrinsics. I also tested various other methods for obtaining a speedup, including using pinned memory. I also parallelized the computation over multiple ligands in the input file, which yields both CPU and GPU versions that parallelize over the ligands. I did this using a work queue and reader/writer threads. Currently the rest of the computation is still being performed on the CPU, and since the minimization is an iterative process, this means that data is repeatedly transferred back and forth between the CPU and GPU. This transfer is pretty wasteful and my next work should get rid of it. Once you have forces on all the atoms, you use a series of steps to decide how those atoms rotate and translate. These steps include converting the positions from Cartesian to internal coordinates, computing a torsion tree representation of the molecule where nodes in the tree represent rigid subunits of the molecule, calculating net torsions of these rigid components, and doing a line search using the BFGS algorithm to translate the molecule. The final coordinates you obtain from this process are used to compute a new score and new forces. Rotations are performed with quaternions. I am working on performing the torsion tree generation and coordinate update on the GPU right now.
Performing the scoring function evaluation on the GPU gave a 3x speedup over the CPU implementation, and parallelizing over the ligands in the input file gave a speedup that is roughly linear in the number of CPUs. So far I can show figures of cumulative speedup as various optimizations were applied, as well as speedup as a function of the number of ligand molecules (the takeaway from that data is that the GPU is currently underutilized...this isn't helped by the fact that my register usage still prevents more than one block from being scheduled per SM at a time). Ultimately I will have plots of total speedup over the parallel CPU implementation for different types of input, and I will have a demo as well since it's really easy to make visualizations of the minimization process - plenty of molecule visualization programs exist, so I can have the program write out a structure file at every iteration to make a little movie. I'm not too worried about finishing right now.