Star-P is a client-server system. The client, running on your desktop, tells the high-performance server what to do, and the server spends its time happily crunching numbers. The server doesn’t like to be micromanaged: it works the best when it can work on large chunks of data all on its own. It is "expensive" for client and server to communicate: after all, the client-server connection is often the slowest link in the hardware.
To illustrate the point, I’m showing the graph output of Star-P’s ppperf profiling tool. The X-axis is time, and the Y-axis indicates the activity spent on the client, on the network, and on the server, for a given code segment. Going from left to right, it shows the communications activity - middle graph, in red - when a 72 MB matrix is sent from client to server. Then, the bottom (blue) graph shows the server activity increase in order to operate on the matrix. As you can see it takes more time to transmit the data than to do the actual computation.

In this first installment of Performance Tuning 101 series, we’ll discuss ways to reduce client-server communications traffic. Let’s first create some data using the *p tag: for those unfamiliar with Star-P syntax, it causes arrays to be created on the parallel server and distributed over CPUs.
x = rand(1,n*p);y=zeros(1,n*p);
By creating the data this way - directly on the server – we avoid data traffic through the client-server bottleneck.
Now, let’s do some number-crunching on these arrays to show the basic technique to avoid client-server traffic: vectorization. Compare element-wise multiplication in a for-loop
for idx=1:n
y(idx) = 2*x(idx);
end
to a vectorized multiplication
y = 2*x;
In the for-loop, a command is sent from client to server in each iteration. This totals 1000 client-server calls, slowing down the code. Not only is the vectorized code simpler, but it requires only one client-server call.

Vectorization not only reduces client-server communications - it also lets you perform parallel operations on large arrays. Consider the following code that could be used to model 3-D random walks:
X=zeros(3,15000,15000);
for idx1=1:15000
y = randn(3,15000);
for idx2=1:15000
x(1,idx1) = x(1,idx1) + y(1,idx2);
x(2,idx1) = x(2,idx1) + y(2,idx2);
x(3,idx1) = x(3,idx1) + y(3,idx2);
end
end
This code took about 105 seconds to run in serial using regular MATLAB®. Just imagine how long it would take on a client-server platform! However, we can replace the for-loops by vectorized and parallelized operations:
X = sum(randn(3,15000,15000*p))
The speed-up is dramatic, this computation took only 4.5 seconds to run using 8 CPUs. In a sense we are trading memory for speed: we avoid the for-loop by constructing an extra 675 million element, 5.4 gigabyte array. This is where Star-P’s distributed arrays become very useful: a regular desktop might not have enough memory, so you’re forced to use the slower for-loops, whereas on a parallel server with large memory, you are free to restructure your code to use vectorized constructs.
So far, we have learned a couple of important lessons:
• Client-server communications is often the "weakest link" of a client-server system.
• When possible, avoid data traffic between client and server.
• Use vectorization.
In our next installments, we’ll take a look what’s happening on the server: what impact do your application types, data sizes and parallel server architectures have on performance tuning.
Article has
0 comments.
Click To Read/Write Comments