Igor's Website - Articles - Million particles in CUDA and OpenGL

Science, stories, art and music.

Science / Computer science articles.

A million particles in CUDA and OpenGL

Introduction

Here I will demonstrate how to make a particle system using CUDA and OpenGL (thereby demonstrating the CUDA – OpenGL interop). Further, I will demonstrate the same particle system on CPU, utilizing OpenMP for parallelization. This could be considered a step-by-step introduction on how to make a particle system using CUDA and OpenGL.

Requirements

To see this example work, you will need a CUDA capable GPU. This particle system was tested on an NVidia GeForce 570 GTX card. I managed to run more than four million particles smoothly. As for the CPU, I tested this on an AMD quad core, and managed to run sixty-five thousand particles.

In order to compile the code you will need the CUDA SDK. If you’re using Visual Studio 2012, setting up might be somewhat troublesome. I have provided links for setting up the environment at the end of the article.

Also, here I assume you know some basic elements of how OpenGL, CUDA and their interoperation work. I have provided links to other articles that inspired this and could perhaps help others in learning OpenGL and CUDA.

Particle system model

Basically, the idea of this is to create something fun, so I decided to model the particle system to follow a given point in the window (similar to my augmented reality program), where particles will retain some momentum. I wanted to change the color of the particles according to their speed and distance from the point to follow.

Screenshot

Here is a video demonstration of the particle system:

For the beginning, we will create a simple mesh for the particles, and give each particle its x and y coordinates in this mesh. For each particle we need to define its position, velocity and color. Position and color will be stored in a Vertex Buffer Object which CUDA and OpenGL will use in interoperation. Since a shared buffer is used this should be very fast. The array of velocity vectors will be stored on the device for fast access from CUDA kernels.

//Constants
const unsigned int mesh_width = 1024;
const unsigned int mesh_height = 1024;

float rnd1[mesh_width*mesh_height];
float rnd2[mesh_width*mesh_height];

//Mouse controls
int mouse_x, mouse_y;
int buttons = 0;
float translate_z = -3.0;

//VBO variables
GLuint vbo;
void *d_vbo_buffer = NULL;

float anim = 0.0;

//Device pointers
float4 *d_array;
float *d_rnd1, *d_rnd2;

The particles from this mesh will start moving once the program is run. We want to translate each particle in the mesh by a random, small amount, so that particles are not too orderly positioned. Since CUDA does not allow random functions in kernels, we will have to create the random translations on the host device.

User controls

We want the user to be able to control the flow of the particles. Let’s define some basic controls for these particles. We will define the left mouse button to be the one for changing the position of the target point (the point to which all the particles will move), right mouse button for stopping the system and zooming, and the middle mouse button for resetting the velocities to zero.

Other than that, we might want to assemble the particles to their original positions and velocities, so let’s assign the keyboard ‘a’ that purpose.

void keyboard(unsigned char key, int, int)
{
    switch(key) {
    case(27) :
      exit(0);
      break;
	  case('a'):
  	  buttons=(buttons!=10)?10:0;
  		break;
    }
}

void mouse(int button, int state, int x, int y)
{
    if (state == GLUT_DOWN) {
        buttons |= 1<<button;
    } else if (state == GLUT_UP) {
        buttons = 0;
    }

    mouse_x = x;
    mouse_y = y;
    glutPostRedisplay();
}

void motion(int x, int y)
{
    float dx, dy;
    dx = x - mouse_x;
    dy = y - mouse_y;

    if (buttons & 4)
        translate_z += dy * 0.01;

    mouse_x = x;
    mouse_y = y;
}

We need to write functions for keyboard and mouse and set them as callback functions for GLUT.

Initialization and cleanup

First, we will have to do basic initialization for the GLUT and GLEW.

glutInit(&argc, argv);

//Setup window
glutInitDisplayMode(GLUT_RGBA | GLUT_DOUBLE);
glutInitWindowSize(window_width, window_height);
glutCreateWindow("Million particles");

//Register callbacks
glutDisplayFunc(display);
glutKeyboardFunc(keyboard);
glutMouseFunc(mouse);
glutMotionFunc(motion);

//GLEW initialization
glewInit();
if (!glewIsSupported("GL_VERSION_2_0 ")) {
    fprintf(stderr, "ERROR: Support for necessary OpenGL extensions missing.");
    fflush(stderr);
	exit(0);
}

//Clear
glClearColor(0.0, 0.0, 0.0, 1.0);
glDisable(GL_DEPTH_TEST);

//Viewport
glViewport(0, 0, window_width, window_height);

//Projection
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
gluPerspective(60.0, (GLfloat)window_width / (GLfloat) window_height, 0.1, 10.0);

We need to set the main window for drawing the particles, register callbacks mentioned above, clear the screen and set viewport and projection. Note that a display function is registered here. This is the function which is executed for each redrawing of the particles.

void createVBO(GLuint* vbo)
{
	//Create vertex buffer object
	glGenBuffers(1, vbo);
	glBindBuffer(GL_ARRAY_BUFFER, *vbo);

	//Initialize VBO
	unsigned int size = mesh_width * mesh_height * 4 * sizeof(float);
	glBufferData(GL_ARRAY_BUFFER, size, 0, GL_DYNAMIC_DRAW);

	glBindBuffer(GL_ARRAY_BUFFER, 0);

	//Register VBO with CUDA
	cudaGLRegisterBufferObject(*vbo);
}
 

Then, we need to create and initialize the Vertex Buffer Object (VBO) which will be used for CUDA and OpenGL interop. We bind the buffer, and get the pointer which will be used by CUDA kernels to interact with the buffer.

 void initialize(GLuint vbo)
{
    //Map OpenGL buffer object for writing from CUDA
    float4 *dptr;
    cudaGLMapBufferObject((void**)&dptr, vbo);

    //Run the initialization kernel
    dim3 block(8, 8, 1);
    dim3 grid(mesh_width / block.x, mesh_height / block.y, 1);
    initialize_kernel<<< grid, block>>>(dptr, mesh_width, mesh_height, anim, d_array, d_rnd1, d_rnd2);

    //Unmap buffer object
    cudaGLUnmapBufferObject(vbo);
}
 

Initialization of the buffer we will do using a simple CUDA kernel. For each kernel that interacts with the VBO we will have to map the buffer object to CUDA, execute the kernel and then unmap the buffer.

__global__ void initialize_kernel(float4* pos, unsigned int width, unsigned int height,
        float time, float4* vel, float* rnd1, float* rnd2)
{
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

	//Calculate the initial coordinates
    float u = x / (float) width + rnd1[y*width+x];
    float v = y / (float) height + rnd2[y*width+x];

    //Calculate a simple sine wave pattern
    float freq = 2.0f;
    float w = sinf(u*freq + time) * cosf(v*freq + time) * 0.2f;

	//Set the initial color
	Color temp;
	temp.components = make_uchar4(0,255,255,1);

	//Set initial position, color and velocity
	pos[y*width+x] = make_float4(u, w, v, temp.c);
	vel[y*width+x] = make_float4(0.0, 0.0, 0.0, 1.0f);
}
 

The initialization kernel is rather straightforward. We determine the x and y indices of the particle the current thread will be processing and then determine its starting coordinates and color. (On a side note: in another article I found a sine wave pattern for a mesh of particles and decided that this is a nice effect to use, especially when particles are stopped to zoom in or out; gives them a “breathing” effect.)

initGL(argc, argv);

//Create VBO
createVBO(&vbo);

//Initialize random arrays
for (int i=0;i<mesh_height*mesh_width;++i)
	rnd1[i]=(rand()%100-100)/2000.0f;
for (int i=0;ii<mesh_height*mesh_width;++i)
	rnd2[i]=(rand()%100-100)/2000.0f;

//CUDA allocation and copying
cudaMalloc(&d_array, mesh_width*mesh_height*sizeof(float4));
cudaMalloc(&d_rnd1, mesh_width*mesh_height*sizeof(float));
cudaMemcpy(d_rnd1, rnd1, mesh_height*mesh_width*sizeof(float), cudaMemcpyHostToDevice);
cudaMalloc(&d_rnd2, mesh_width*mesh_height*sizeof(float));
cudaMemcpy(d_rnd2, rnd2, mesh_height*mesh_width*sizeof(float), cudaMemcpyHostToDevice);

initialize(vbo);
 

In the main() function we will call the above basic initialization function, initialize our random displacement arrays, copy the arrays to the graphics device and initialize the VBO.

After this, we can run the glutMainLoop() which will draw the particles until the window is closed.

	cudaFree(d_array);
	cudaFree(d_rnd1);
	cudaFree(d_rnd2);
	return 0;

When the window is closed, before closing the application, we first free the CUDA allocated arrays.

Drawing the particles

As I said above, the display function is called for each drawing of the particle system. This is where the particles are positioned and drawn to the screen.

static void display(void)
{
    //Process particles using CUDA kernel
    particles(vbo);

    glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);

    //View matrix
    glMatrixMode(GL_MODELVIEW);
    glLoadIdentity();

    glTranslatef(0.0, 0.0, translate_z);
    glRotatef(90.0, 1.0, 0.0, 0.0);

    //Render from VBO
    glBindBuffer(GL_ARRAY_BUFFER, vbo);
    glVertexPointer(3, GL_FLOAT, 16, 0);
    glColorPointer(4,GL_UNSIGNED_BYTE,16,(GLvoid*)12);

    glEnableClientState(GL_VERTEX_ARRAY);
    glEnableClientState(GL_COLOR_ARRAY);
    glDrawArrays(GL_POINTS, 0, mesh_width * mesh_height);
    glDisableClientState(GL_VERTEX_ARRAY);

    glutSwapBuffers();
    glutPostRedisplay();

    time += 0.01; //Increase the timer for the wave pattern
}

We need to set the view, rotation and translation of the camera and then render from the buffer object. The buffer object contains the array of vertices and their colors. Functions glVertexPointer() and glColorPointer() determine how these values are distributed in the buffer array. At the bottom of this article I provided links to the NVidia site where the functions are elegantly described. Using glDrawArrays() we draw the content of the VBO according to the parameters specified in the above two functions. Function particles() is very similar to the initialize() function, however it calls the particles_kernel() kernel which is the CUDA kernel in charge of positioning the particles and calculating their velocities and color.

Positioning the particles

I left this part for the end because this is where the fun happens. I will explain main pieces of the code here. From there you can make your own effects.

const float speed = 0.0005f;
const float threshold = 0.1f;
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

float u = x / (float) width;
float v = y / (float) height;

As in the initialization kernel, we need to determine which particle we are dealing with and then work with its coordinates, velocity and color.

if (buttons==10)
{
    vel[y*width+x].x=0;
    vel[y*width+x].z=0;
    dx = -pos[y*width+x].x + u;
    dz = -pos[y*width+x].z + v;
    length = sqrtf(dx*dx+dz*dz);
    pos[y*width+x].x+=dx/length*speed*10;
    pos[y*width+x].z+=dz/length*speed*10;
}
 

The condition (buttons==10) is a check for the keyboard ‘a’ button which will tell the particles to reassemble to the starting position. We add a velocity vector pointing towards the particle’s original position (u, v) to the particle’s position. In a few iterations the particle should return to its original position. Note that the actual velocity vector is set to zero and the temporary velocity vector is calculated for each execution of the given particle’s thread. This is to avoid building momentum.

float2 normalized = make_float2(dx/length*speed, dz/length*speed);
vel[y*width+x].x+=normalized.x;
vel[y*width+x].z+=normalized.y;
dx = vel[y*width+x].x;
dz = vel[y*width+x].z;

For the movement of the particles towards the target point we need to have the momentum buildup, so we will use the velocity vector. However, we want the speed to be constant and always point towards the target point, so we need to normalize the vector.

if (velocity>threshold)
{
    vel[y*width+x].x=dx/velocity*threshold;
    vel[y*width+x].z=dz/velocity*threshold;
}
if (pos[y*width+x].x<-5.0f && vel[y*width+x].x<0.0)
    vel[y*width+x].x=-vel[y*width+x].x;
if (pos[y*width+x].x>5.0f && vel[y*width+x].x>0.0)
    vel[y*width+x].x=-vel[y*width+x].x;

We don’t want the particles to go too crazy, so we will limit their speed and area of movement. (In this example I limited the area only for the x axis. Limiting it for the y axis is trivial and can be achieved by a few copy-pastes.)

Finally, if the middle button is pressed, we want the particles to converge rapidly to the target point (to reset the effect).

vel[y*width+x].x=0;
vel[y*width+x].z=0;
pos[y*width+x].x+=dx/length*speed*10;
pos[y*width+x].z+=dz/length*speed*10;
Color temp;
temp.components = make_uchar4(255/length,255/length, 255, 10);
pos[y*width+x].w = temp.c;

The right button effect mentioned earlier (for stopping the particles) is not coded as a check, but is achieved by the fact that when the right button is clicked none of the other conditions will be met, and hence the particles’ positions will not be updated (only the sine wave pattern will be updated, which will give the particles the “breathing” effect I mentioned earlier).

union Color
{
	float c;
	uchar4 components;
};

Color temp;
temp.components = make_uchar4(128/length,(int)(255/(velocity*51)),255,10);
pos[y*width+x].w = temp.c;

Since we are using colors as structs of unsigned bytes and the buffer consists of float4 (structs of four floats), an elegant cast between a float and uchar4 is a union of those two.

The full particles_kernel() is given below.

__global__ void particles_kernel(float4* pos, unsigned int width, unsigned int height,
        float time, float X, float Y, float4* vel, int buttons)
{
    const float speed = 0.0005f;
    const float threshold = 0.1f;
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

    float u = x / (float) width;
    float v = y / (float) height;

	float xX = (X - width/2 + 128)/(float)width*4.5f;
	float yY = (Y - height/2 + 128)/(float)height*4.5f;
	float dx = -pos[y*width+x].x + xX;
	float dz = -pos[y*width+x].z + yY;
	float length = sqrtf(dx*dx+dz*dz);
	if (buttons==10)
	{
		vel[y*width+x].x=0;
		vel[y*width+x].z=0;
		dx = -pos[y*width+x].x + u;
		dz = -pos[y*width+x].z + v;
		length = sqrtf(dx*dx+dz*dz);
		pos[y*width+x].x+=dx/length*speed*10;
		pos[y*width+x].z+=dz/length*speed*10;
	}
	else if (!(buttons & 4) && !(buttons & 6))
	{
		float2 normalized = make_float2(dx/length*speed, dz/length*speed);
		vel[y*width+x].x+=normalized.x;
		vel[y*width+x].z+=normalized.y;
		dx = vel[y*width+x].x;
		dz = vel[y*width+x].z;
		float velocity = sqrtf(dx*dx+dz*dz);
		if (velocity>threshold)
		{
			vel[y*width+x].x=dx/velocity*threshold;
			vel[y*width+x].z=dz/velocity*threshold;
		}
		Color temp;
		temp.components = make_uchar4(128/length,(int)(255/(velocity*51)),255,10);
		if (pos[y*width+x].x<-5.0f && vel[y*width+x].x<0.0)
			vel[y*width+x].x=-vel[y*width+x].x;
		if (pos[y*width+x].x>5.0f && vel[y*width+x].x>0.0)
			vel[y*width+x].x=-vel[y*width+x].x;
		pos[y*width+x].x+=vel[y*width+x].x;
		pos[y*width+x].z+=vel[y*width+x].z;
		pos[y*width+x].w = temp.c;
	}
	else if (!(buttons & 4))
	{
		vel[y*width+x].x=0;
		vel[y*width+x].z=0;
		pos[y*width+x].x+=dx/length*speed*10;
		pos[y*width+x].z+=dz/length*speed*10;
		Color temp;
		temp.components = make_uchar4(255/length,255/length, 255, 10);
		pos[y*width+x].w = temp.c;
	}

    float freq = 2.0f;
    float w = sinf(u*freq + time) * cosf(v*freq + time) * 0.2f;

	pos[y*width+x].y=w;
}

OpenMP CPU implementation

The only modification in the CPU implementation of the particle system is in the initialize(), particles() and display() functions. Since we cannot launch CUDA kernels, we have to inline their functionality in the initialize() and particles() functions. We iterate using for loops through the array of the particles.

#pragma omp parallel for
for (int x=0;x<mesh_width;++x)
    for (int y=0;y<mesh_height;++y)
    {
        ...
    }

If the outer loop is parallelized, we get performance gains.

However, since we are not using CUDA, we will not draw from the buffer. Instead, we will draw from an array kept in the host memory and iterate through it using two for loops.

On my CPU, I managed to produce a particle system of 65536 particles using this implementation. It could handle a bit more, but I didn’t test it.

Related articles

I found these articles helpful in making this particle system and understanding the CUDA and OpenGL interop:

Source code and license

The source code and the application are available here: million particles.zip

10.3.2013

Google Code Prettify

I use Google Prettify to format the source code in my articles. If the code is displaying in one line, you can try opening the page in a different browser.

Request software design

If you wish to have a specific application designed, contact me at software@igorsevo.com. If you want to know more about what I do, check out my home page and Science page.

Support this site

Suggest an article

You can suggest an article or ask a question on the Questions page.

Social