Accelerating Cloth Simulation with CUDA

A project submitted in partial fulfillment

of the requirements for Parallel Computer Architecture and Programming


Carnegie Mellon University

By Ian Heath


Research into the simulation of cloth has been ongoing for over two decades. My focus was on how to bring fast cloth processing to the average consumer, with the assumption being that the cloth simulation would be used in interactive experiences in real time. Therefore, I devised a method that makes use of a very common source of parallelism that is within reach of most consumers - the GPU. I based my simulation on semi-implicit Verlet Integration methods as these are common for real-time interactive experiences. The demo ran on a machine with a core i5 3570k processor at stock speeds (3.4-3.8 Ghz), 4 GB of RAM, and an NVIDIA GTX 980.


        The simulation of deformable or non-static solids in general is an area of ongoing research. Soft-body physics are a key component of medical software and surgery simulation techniques, of realistic ragdoll body physics in modern videogames, and software used in the fashion industry to simulate the motion of fabrics. In general, there are many materials in the physical world which are not completely rigid and whose simulation is necessarily computationally complex. Here, I focus on the simulation of cloth. The first methods for digitally simulating the motion of cloth were proposed as early as 1995, and later became a focus of study that produced ever more detailed simulations. However, the goal of my project was not to produce an industrial-grade hyper-realistic simulation, but a pleasing and fast-running one. For this reason, I chose to focus on a cloth model and integration method that has been tried and tested in real world digital entertainment. My goal was to use a resource that would likely be available to any digital entertainment enthusiast as a means for accelerating the simulation, and of course, the Graphics Processing Unit was an obvious choice.

Verlet Integration


        In any physical simulation in which moving bodies must be tracked, we will require some kind of integration scheme to determine where those bodies should be with respect to the current timestep of the system. Verlet integration does this by tracking both the current and last position of bodies in the system, and their current acceleration.


Verlet Integration tracks the previous and current positions of bodies and infers velocity.

(Image Source:

In this way, it is a semi-implicit method - we don’t track velocity of the bodies directly, we infer it from the difference in their positions over the previous and current timestep, and at each step of this system, account for acceleration forces (such as gravity) by adding them to the velocity through a single integration (multiplying the acceleration by the time step to get the velocity contribution). Verlet Integration does not have the advantage of being able to take arbitrary time steps , but this was not important for my approach to cloth - my goal was simply to assess what kind of speedup the GPU could provide over the CPU.

The Cloth Model

        There are various methods for modelling the internal dynamics of fabric, but one of the most popular is the pointmass-spring method. In this method, the cloth is treated as a collection of pointmasses, or points in space with some associated mass (on which gravity can act, for example), and constraints between these pointmasses (i.e., the springs) which dictate how far they can travel from each other. In the model I used in particular, pointmasses were arranged in a uniform grid and constraints were defined pairwise on the pointmasses (any one constraint is between exactly two pointmasses). The arrangements of the constraints on the pointmasses dictates the behavior of the cloth in simulation.

A portion of the pointmass-spring cloth model.

(Image Source:

In the model I used, each pointmass had a constraint connecting it to all of its primary neighbors (the neighbor to the right, the neighbor below, the neighbor to the right and below, and the neighbor to the left and below, assuming each neighbor exists). I call these constraints the “primary” constraints since they go between directly adjacent pointmasses (in this case adjacent also encompassing diagonally adjacent), and these constraints are what cause the cloth to tend to stay in one plane and keep all the pointmasses connected to each other. There are also secondary constraints, with these mirroring the behavior of the primary constraints, except that these constraints connect a pointmass with its neighbors once-removed (they skip one pointmass).  These constraints act to allow the cloth some flexibility (that is, to “bend). The entire arrangement is summed up by the image above, showing a portion of the cloth model.

Constraint Satisfaction

        The interesting behavior of the cloth comes from the disturbance of some pointmass, either through collision with some kind of geometry (in my simulation, I used simple pointmass-sphere collisions), an external acceleration (such as gravity), or other force directly applied to the pointmass (like virtual “wind”). Some constraints will no longer be satisfied, and surrounding pointmasses must move in such a way as to satisfy the constraints once more (in this way, a disturbance in the position of one pointmass will “propagate” over the entire cloth as the constraints are re-satisfied).


Satisfying a broken constraint.

(Image Source:

In general, we stept the simulation forward in time by calculating new positions of pointmasses due to disturbances and then re-satisfying any broken constraints. We do this by moving the two pointmasses the constraint applies to closer to each other by finding the vector between them, and moving each in the direction of the other by half of the difference between the rest length of the constraint (what the distance between the pointmasses should be as described by the constraint) and their actual distance. However (and as some readers may already have noticed), there is an issue that can arise from this approach. If we iterate through the constraints and satisfy each one in turn, we may perturb some pointmass while trying to satisfy the current constraint, such that a constraint we previously satisfied will be unsatisfied once more. This problem is

Satisfying constraint on constraint can break another (unsatisfied constraints shown as disconnected lines).

(Image Source:

solved through a method called “relaxation”, or satisfying the constraints multiple times. Every time we iterate through the constraints and satisfy them again, we come closer and closer to satisfying them all simultaneously. This method makes it easy to vary the computational complexity of the algorithm vs the realism of the results. Additionally, the fewer times we satisfy constraints the “stretchier” the cloth will be (as points are allowed to deviate from their constraints due to less stringent constraint satisfaction), and the more times we satisfy them, the more rigid the cloth becomes. This iterative constraint satisfaction incurs the greatest computational cost intrinsic to this method of cloth simulation. Therefore, we will focus on how to parallelize it.


My focus was on how to parallelize the constraint satisfaction process. I had already decided on using CUDA to this end, and I knew that I would be rendering the results with OpenGL.  The machine used for rendering was my own desktop, since it is the kind of machine a consumer interested in videogame-grade cloth rendering might own. I built my GPU-parallel cloth simulator by building up from an existing sequential simulator which can be found here. At first, I assumed that I would simply be able to use CUDA to satisfy all of the constraints simultaneously, and my work was going to focus on adding features like cloth self-collision. However, I quickly found out that parallelizing the constraint relaxation algorithm would need to be my primary focus. Below is the result of naively trying to satisfy all  of the constraints at once.


                                                            Seems a little off.

What I was doing here was making a stack (that is, a one-dimensional list) of CUDA blocks of 32 threads each (to match the fact that each of the 980’s 16 SMX cores can each execute 32 threads simultaneously), with enough total blocks match every constraint. The threads would find their unique thread IDs and satisfy the constraint they were assigned by moving the corresponding pointmasses.

        This approach, of course, was destined to fail miserably. The problem is that any given pointmass participates in multiple constraints. While one CUDA thread is busy satisfying one constraint, another CUDA thread might be busy satisfying another, with one of the pointmasses participating in both constraints. If this is the case, then we have a race condition. One or the other of the threads might use a stale value for the position of the pointmass, or one could compute a result, move the pointmass, and have its work overwritten by the other when it completes its computation. Clearly, the sequential algorithm had to be modified so that no pointmasses would be operated on simultaneously by two separate CUDA threads trying to satisfy two separate constraints.

        My first thought was to use locks and some ordering on threads. This way, only one CUDA thread could be working with a given pointmass at any given time, and deadlock could be avoided by having lower indexed threads defer control to higher indexed threads, for example. However, locks are a foolish idea in CUDA, given that they can potentially starve out much of the opportunity for parallelism by preventing the rest of the threads in the block from running, even if they don’t need the lock. Moreover, I didn’t want to slow things down with the overhead of locks in general. Thus, I came up with a new approach.

Mutually Exclusive Constraint Satisfaction

        The answer to implementing a lock-free and race-free parallel constraint satisfaction algorithm lay in the idea of creating groups of constraints which had no common endpoints. In essence, the problem consisted of how to minimally bucket all of the constraints such that each bucket contained only constraints which could be satisfied in parallel (no pointmasses in common), and that over all of the buckets, every constraint was represented. My solution creates 16 mutually exclusive constraint buckets. For the primary constraints, these are the “even verticals”, or vertical constraints with the first pointmass in an even row, the “odd verticals” (first pointmass in an odd row), and so on, for the horizontals, diagonals, and anti-diagonals. The secondary constraints are bucketed in essentially the same way. Because every pointmass could potentially have one of each kind of constraint attached to it, we cannot satisfy two buckets at once.

A 10x10 (pointmass) visualization of the 8 buckets for the primary constraints. There are 8 more buckets for the secondary constraints which operate in much the same way.

However, MECS cuts down the runtime of the algorithm by at least a factor of 16. Since constraints can just consist of two pointers into an array of pointmasses, these can be generated and stored in GPU global memory before the simulation starts, after which they do not need to be transferred again. However, since I left collision and openGL commands to the CPU, the pointmass array did need to be copied to the GPU before it began satisfying constraints, and copied back to RAM when the GPU was done. It is also important to note that the GTX 980 can essentially satisfy up to (32 threads/warp  * 16 SMX cores) = 512 constraints at once, however, meaning that so long as our buckets contain 512 constraints or less, constraint satisfaction can be done in essentially constant time, limited primarily by the overhead of transferring the pointmass array to the GPU and back. We will see that this was in fact the case in the results section.


        It will be informative to first look at the raw gains produced by using the GPU to perform constraint satisfaction to get an overall idea of how the GPU performs. With this information in hand, we can then break down how the algorithm performs in comparison to the other portions of the simulation in more detail, and reach a conclusion about the limits of the potential speedup afforded by the GPU. Let’s first take a look at the frames per second (FPS) achieved across a variety of cloth resolutions. Here, resolution refers to the number of pointmasses used to simulate the cloth. Note that with all of the following results, no parallelism was employed on the CPU (only a single core was used).

Here, we see that at a resolution of 40x40, satisfying constraints sequentially on the CPU actually yields better performance than using the GPU to do so. We will see exactly why this is the case in a moment. Note that at all other cloth resolutions, the GPU beats out the CPU.

GPU (left) Vs. CPU (right) at a cloth resolution of 80x80. Framerates displayed in upper right-hand corner.

Now let’s take a look at the FPS chart from another perspective - relative performance increase - in order to discover a another trend.

Here we can see that not only does the GPU overtake the CPU, it continues to do better and better (relatively) as the cloth resolution increases. These two graphs still don’t tell the whole story, though. Relative performance increase doesn’t mean much if we’re talking about 0.8 FPS vs. 0.2 FPS. Let’s dig a little deeper to discover what’s really going on here. Below is a breakdown of times required for various components of the simulation across all cloth resolutions (note that I include two versions of this graph, one to get a good idea of the trend, and one including the CPU’s 160x160 trial, which tends to make it hard to tell what’s going on in the other conditions in the first graph).

Here we can see that the time taken to satisfy constraints on the GPU, in all conditions except for the 160x160 case, is practically non-existent. The CPU’s constraint satisfaction time scales roughly linearly with the increasing number of pointmasses (and thus constraints), while the GPU, even at 160x160, spends virtually no time on constraint satisfaction (less than or equal to one millisecond). We also observe that the time taken to copy the pointmass array to the GPU and back (GPU Data Copy) remains roughly constant in each simulation. This implies that we are not looking at a transfer rate issue (we come nowhere near saturating the bus with data), but instead take only the penalty of the overhead involved in sending any message to the GPU in general. We also see why the CPU beats out the GPU in the  40x40 test: the overhead of transferring data to the GPU negates the advantage of its lightning quick constraint satisfaction ability. Framerates at the 160x160 resolution were already in the territory of “not realistic for real-time interaction” at 13.8 FPS on the GPU and 3.62 FPS on the CPU, so I did not push my simulation past these resolutions.

        What can we extrapolate from these results? The 160x160 case is very informative. As the resolution of the cloth increases, the computation time becomes dominated by the time it takes to draw the cloth to the screen (that is, for the CPU to iterate through the OpenGL drawing commands). The overhead associated with transferring data to the GPU remains roughly constant, and the time taken to check collision with the ball does increase, but by far the largest performance penalty as cloth resolution increases is the rendering performance of OpenGL. Thus, I believe that there is a potential for huge performance gains that remains untapped. I am by no means an experienced OpenGL programmer, but a few thousand triangles should be easy for any modern day card to render (most videogames today have more triangles than this in a single character). Therefore, I believe that with improved OpenGL rendering code, we could be rendering extremely high cloth resolutions in real time using CUDA to satisfy the cloth constraints, and at high FPS (160x160 at 30 FPS would be easily achievable). Future work remains to determine if this is in fact the case, but simple reasoning about the number of triangles that a modern day card should be able to render in real time dictates that there is no reason it shouldn’t be. Thus, high-resolution real time cloth should truly be attainable in interactive experiences for the common consumer.