I describe work to produce a GPU version of an algorithm that takes a pair of molecules, one of which is mobile subject to certain physical constraints, and optimizes the spatial positions of the atoms of the mobile molecule by moving it into a local energy minimum. This process (the output of the program over several iterations) is shown in the gif above. The program I worked on was smina, available at http://smina.sf.net/, although at this time most changes are limited to the "experimental" version of smina, known as gnina. You can get the latest version relevant to this project from the jss_progress branch of gnina (changes are being gradually merged into the master branch as they are tested). Most of the work is in src/main and src/lib.
My research group cares about improving the performance of smina because we maintain a webserver that allows anyone to perform a similarity search through databases of existing compounds for ones that look like they might bind to a target (it's called pharmit if you want to demo it; might be best to start from one of the examples). smina runs on the backend, energy minimizing the results of a query. The results may be up to a million hits, and there may be many results to minimize at once. So minimizing the results as efficiently as possible is desirable.
We particularly want a GPU version of the minimization routine because we're using gnina to do other things that will be running on the GPU anyway. Even if it ends up not being cheaper to run the basic minimization routine on the GPU, we want to integrate the minimization routine with convolutional neural net work we're doing (currently scoring function training and using a trained neural net to generate possible drugs using a minimal scaffold) and also expect that some of the code will be reusable for conformational entropy calculations we will definitely need to run on the GPU. I focused on developing a mainly GPU version first, and a good final version would probably include both CPU and GPU parallelism that might be selected based on the application or size of the input file.
If you're searching for new drugs to target a specific molecule, it's often a good idea to start looking in a database of existing drugs. You might do a similarity search through that database, using features of a molecule that you already know binds to the molecule of interest. Then you will want to rank the resulting hits somehow, in a manner that is at least correlated with the free energy associated with the two molecules binding with each other - the free energy describes the strength of binding in this case.
You might use a knowledge-based scoring function to assess aspects of the interaction between the two molecules, since this is cheaper than trying to calculate the free energy. But there's a problem with directly calculating the score using the spatial orientation of the hits that were returned from your similarity search - they're aligned to your query, and may be completely nonphysical. They might actually have electrons in the two molecules directly on top of each other, or they might have parts of the hit molecule rotated so that large electron clouds are very close to each other.
The solution is to optimize the pose that was returned via the similarity search, walking it down to an energy minimum and returning the final score you will use to rank all of your hits against each other as possible drugs for your target. This process of optimizing a nonphysical pose by finding a local energy minimum is not limited in applicability to the "virtual screening" process I just described, but it is an important application of it that is useful in academia and industry alike. My research group has a program, smina, that performs single-threaded CPU scoring, minimization, and docking (a Monte Carlo method to find a minimum that's generally better than a local minimum generated from a randomly chosen starting pose). I focused on parallelizing the energy minimization parts of the program, modifying the algorithms as necessary to produce a performant solution. Terminology note - the mobile small molecule is often referred to as a 'ligand' and the molecule it's binding to is often a protein, or sometimes more specifically a receptor. I mostly avoid using these terms later, but in a few places it felt awkward to keep calling the small molecule the "mobile molecule" and I called it the ligand.
The minimization process works by computing the energy and gradient associated with the mobile molecule's current position. You can represent a conservative force (one that depends only on position) as the negative gradient of a scalar potential energy. This potential energy isn't exactly the free energy of a system, but it's a significant part of it, and in this case we want to find a minimum of an approximation of the potential energy, using the forces the mobile molecule's atoms experience as the guide (since they are the negative of the gradient of the energy).
The "scoring function" is an approximation of the potential energy, and it is a weighted sum of terms - electrostatic, van der Waals, etc. The atoms have "types" with associated parameters that are plugged into the scoring function, and there is a cutoff distance for considering an interaction relevant. This lets us represent the energy terms as a bunch of cubic splines. We can precompute these splines using all the possible pairs of atom types, leading to a simple way of looking up the correct set of splines given a pair of atoms and plugging in their current distance from each other to calculate the energy and forces resulting from their interaction. Using cubic splines lets us easily compute the forces and gradients, and guarantees that both of these things will be continuous everywhere, even at the cutoff distance (where a linear approximation's gradient would become discontinuous). Computing the energy and forces associated with the mobile molecule's current position comes down to computing a bunch of pairwise terms for the mobile molecule's atoms and all the other atoms in either molecule that are within the cutoff distance.
Since this N-body problem is constrained by the bonds holding the atoms in the mobile molecule together, there aren't actually 3N degrees of freedom. Even though the energies and forces are evaluated in Cartesian coordinates (because we're considering interactions between both molecules), when we figure out how to move the mobile molecule, allowing it to be flexible and experience torques around rotatable bonds, it makes sense to represent it using a reduced set of coordinates. In mechanics it's common to use this reduced set of "internal coordinates," which represent just the actual degrees of freedom the body has. These include the rigid translation and orientation of the molecule (the latter of which is represented with quaternions in the program) as well as motion around the rotatable bonds I described, which are called torsions. To figure out how to propagate forces and torques through the molecule and coordinate updates to a given part of the molecule with updates to the relevant torsions, the molecule is represented as a tree of rigid subunits. Atoms joined by bonds that they can't rotate around are treated as a single node in the tree, and connections in the tree represent bonds that can rotate. A resultant force is computed, starting at the leaves and working up the tree, computing a net force and net torques via the cross product. Each node except for the root updates the torsions array, and the root updates the position and orientation to reflect rigid translation and rotation. These updates form a representation of the gradient in internal coordinates as directional delta terms.
The gradient is then passed to a minimization algorithm (in this case, BFGS), where a step size in the direction of the gradient is chosen to update the atom positions in the mobile molecule. BFGS is a quasi-Newton method, so it builds up an approximate Hessian using the change in the gradient over successive iterations - this information about the second derivatives of the potential allows the step size to be chosen cleverly, for example by taking smaller steps if the gradient is changing rapidly. If the gradient is close to zero, we're at a minimum and can stop, returning the value of the energy at the new position. Otherwise we continue to iteratively calculate the current energies and the gradient, compute the gradient in the mobile molecule's internal coordinates, take varying step sizes in the direction of the gradient, and check for convergence. In either scenario we also need to return the updated atom positions in Cartesian coordinates, so the tree representing the mobile molecule is traversed again to produce this final output from the internal coordinate representation.
One challenge with this project is that the code base is pretty big - cloc is reporting around 100,000 lines of C++ code. There are also a lot of ways various parts of the pipeline could be performed and lots of room for performance tuning and algorithm development. The original code is also heavily object-oriented in places, particularly representing the torsion tree, and it uses a Boost quaternion library that I couldn't find a canned replacement for so I had to write a minimal version myself (not actually hard, though, since it only had to support a few operations).
One strategy for parallelizing the program is just to minimize multiple ligands in the input at a time. I parallelized over the mobile molecules in the input file; I did this via a work queue on the CPU. A single thread read in molecules from an input file and added objects to a queue to be minimized. A pool of workers popped from the queue, performed the minimization job, and stored their results in another queue. A single writer thread popped from the result queue and wrote the output to the file. Since we wanted to preserve the overall order of the molecules in the input and output files, the jobs were numbered and written out in order. This gave speedup that scaled reasonably with the number of CPUs (on the results shown below, the program was tested on a machine with 12 execution contexts). This version could in theory also launch GPU kernels in parallel by using the thread-local default stream, so it will be desirable going forward to incorporate both CPU and GPU parallelism into a single program for maximum performance benefits. Plots showing speedup against the CPU are showing speedup over the ligand-parallel CPU version, but it's notable that SIMD parallelism still hasn't been added to the CPU version.
The first part of the algorithm to be moved to the GPU was the scoring function evaluation. This part is embarrassingly parallel, with pairwise energies and forces being calculated for every pair of atoms between the two input molecules. A block of threads was launched per ligand atom, with each thread computing the energy and force terms associated with a single receptor atom and the ligand atom assigned to its block. However, it involves several reductions of energy and 3-dimensional force terms - both across all pairwise terms associated with a single atom and all atoms in the ligand. These were initially a significant bottleneck, so I replaced the initial reductions with a slightly modified version of the warp-based reductions described by Nvidia's Mark Harris here. Basically you do a series of pairwise additions with the offset of your partner reduced by a factor of two each round - i.e. first you add your value with that of the thread offset from you by 16 elements, then with the thread offset from you by 8 elements, etc., until thread 0 in the warp has the final reduced sum. Using a warp for this is nice because threads within a warp are inherently synchronous and they also have an intrinsic instruction called a shufl that lets them share each other's register values. With 1,024 elements spread across 32 warps you can do 1 round of warp shuffles, then have the thread 0 of each warp write to a shared memory buffer, do a single synchronization, and then have warp 0 do a read of the buffer and the final reduction. You have to do a little extra if you want to support non-power-of-two inputs, which I did in this case.
The reductions were the biggest bottleneck when profiling the naive GPU scoring function evaluation initially, and improving them gave ~2X speedup. I optimized a few other things at this stage, too. For one thing, I was passing around a bunch of float3 and individual float data, so I turned those things into float4 structs that were defined using the __align__ keyword. In nvvp I could actually see that the reads and writes of these data were being coalesced. I did this for the forces and energy terms as well as the coordinates and charge terms. On the host the analogous data for the things that had been float3s were of user-defined vec type, consisting of three floats, but it was desirable to keep them in that form on the CPU side so they were padded with a float to allow copying to the GPU type. I started playing around with streams here, but didn't really see a benefit until there was a multi- threaded CPU implementation. Then it was useful for CPU threads to work in their thread-local default stream, but not as useful as it will be when I can reduce the block register usage, as explained below. Using pinned memory on its own didn't do much, but it made it possible to do asynchronous memCpys and memSets which hid some of the time required for those operations - mainly the forces and energies, which are memset to 0 on every iteration. The values being copied unfortunately can't be hidden very well unless you are running multiple minimizations on the GPU at a time. Another thing that was beneficial was to pass the arrays of splines to the kernel by value, putting them in constant memory. Notably, in the figure showing cumulative speedup I was testing on a randomly chosen dataset, and the performance varies a lot depending on the composition of the dataset, as one of the later figures shows (because having more mobile atoms gives you more things to parallelize over). So the typical speedup on non-pathological (read: not tiny) sets of results is more in the 3-3.5X range for the final version from this optimization procedure. Error bars given by the standard error on the mean of several trials with this same data set, so this probably isn't that accurate - should be done with different datasets.
With the scoring function evaluation on the GPU but the rest of the minimization routine still running on the CPU, data transfers back and forth were still required for every minimization iteration. The GPU would compute the energies and forces, but then these would be transferred back to the CPU to convert the gradient to internal coordinates, perform BFGS, update the Cartesian coordinate representation of the mobile atom positions, and the resulting coordinates would be transferred back to the GPU for another round. You can see from the figure that memCpy and memSet calls were taking a lot of time in this version - the y axis shows the percentage of the time the device spent in that call.
Performing just the scoring function evaluation on the GPU gives approximately a 3x speedup over using just the CPU for the minimization, but profiling shows that cudaMalloc and cudaMemcpy operations currently form a significant part of the cost of running on the GPU. Additionally, by looking at performance as a function of molecular weight (a proxy measure of the number of atoms in the mobile molecule) we can see that the GPU is likely underutilized - in fact, profiling with nvvp shows that register requirements for a block are preventing multiple blocks from being scheduled per SM at a time, so the GPU is at best running at 50% of peak performance. I leave that problem to the side for now, but it's a significant one that still needs to be addressed.
The next thing to be brought over to the GPU was the operations involving the torsion tree. While working on this aspect I started to think about needing to significantly modify the CPU code to make it perform better. The original tree definition actually involves 10 different classes that inherit from each other in various ways and are templated to enable the representation to work for updates to the mobile molecule as I've described, as well as allowing for subregions of the other molecule to be flexible as well, which I haven't worked on yet. The mobile molecule representation is a derived class of the heterotree and atom_range classes, with the heterotree templated on the rigid_body class. The derivative and conformation update functions are recursive calls on the nodes, where nodes in the heterotree are of two different types - one for the root and one for all the rest, due mainly to the fact that nodes within the tree update torsions while the root updates rigid position and orientation. It all seems very elegant but isn't optimized for efficiency as far as I can tell, and I started working on flattening the representation for use on the GPU. Unfortunately things started getting messy here. This is where I had to generate a quaternion class with some operations for the GPU. I also didn't want to have to rewrite everything if I didn't have to, so I decided to try to use Unified Memory to simplify getting the tree onto the GPU. I made a gpuVisible class that redefined new and delete to call cudaMallocManaged, and had classes like the tree (and the overarching "model" class that includes a lot of the other data about the molecules) inherit from that class so that things internal to the class would be allocated on the device. Also I took advantage of the fact that std::vector allows you to specify a custom allocator to define a gvector class that inherits from std::vector with a gpu_alloc function that is a wrapper around cudaMallocManaged with error checking. Armed with these modifications, I created a single GPU tree class that is gpuVisible, with nodes that are objects of two classes found in the original tree definition but whose internal vectors and new/delete calls use cudaMallocManaged. The nodes are stored in a flat array in their BFS order and have a field indicating the indices of their child nodes in the array, used for the gradient generation and coordinate update calls, which propagate up the tree. However, despite the fact that putting the tree operations on the GPU should have significantly decreased the amount of data being moved back and forth on each iteration, this way of doing the tree updates actually significantly reduced performance. I can see from profiling that memory transfers now happen later, when the host or device actually needs the piece of memory. This completely exposes the latency of the transfers since they are no longer happening asynchronously. By losing control over directly calling memCpy or requesting asynchronous copies, the copying is being done in a much less intelligent way. Also I suspect that reusing parts of the original CPU implementation (for example, the definition of the two node classes) is just really inefficient. For example, the internal coordinates of each node are being accessed at the same time, but the classes haven't been organized such that the data are anywhere near each other in memory. Additionally, without the line search required to do the final coordinate update (which decides how much you move in the direction of the gradient) you still have to do transfers back and forth between the device and host on every iteration, and these are only somewhat smaller than before (since they are in internal coordinates rather than Cartesian coordinates). So the only thing to do is probably rewrite the whole tree definition. Actually, I'm wondering if the tree was only useful for the CPU version of the algorithm - the goal was to reduce the coordinate space, but maybe that doesn't matter much on the GPU and the cost of converting back and forth between the two representations might be greater than the payoff for having fewer coordinates in the minimization routine. The figure shows the speedup for several versions on a dataset with normally distributed molecular weights (so it shouldn't be biased for or against the GPU parallelism).
David Ryan Koes, Matthew P. Baumgartner, Carlos J. Camacho. Lessons Learned in Empirical Scoring with smina from the CSAR 2011 Benchmarking Exercise. Journal of Chemical Information and Modeling (2013)
Jocelyn Sunseri, David Ryan Koes. Pharmit: interactive exploration of chemical space. Nucleic Acids Res gkw287 (2016)
Mark Harris. Optimizing Parallel Reductions in CUDA. NVIDIA docs. https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf