Parallel Lounge: Parallel Computing Blog for Engineers, Scientists, Analysts

Current Posts |  RSS Feed

ISC on Intel's BlogTalkRadio

Posted by David Rich



The other week, I was interviewed on "Parallel Programming Talk Radio" by Aaron Tersteeg and Clay Breshears -- the show is 15 minutes and there are a few minutes of introduction before and a few after. The resulting ~10 minutes or so felt like an awfully short time to talk about Star-P

If you're new to Star-P, then this is perhaps a fast way to get an introduction, but if you already know what we're about, then there may not be enough new to make it worthwhile.  On the other hand, since all of you are obviously parallel processors, why not play the interview, do your email and simultaneously run simulations in parallel on Star-P?

You can listen to the interview right here (see below) or if you'd like to click through BlogTalkRadio (and see what else goes on there), click here.

 



Article has 0 comments. Click To Read/Write Comments

Distributed Parallel Search at IDF

Posted by David Rich



I attended IDF this past week and while there was quite a bit of noise about Nehalem and Intel's new 3GL parallel development tools, you can read about that stuff anywhere.

I suppose you can also read somewhere about Jeffrey Katzenberg of Dreamworks giving a talk during the software keynote (about this)-- wow, I can't wait until we have that kind of marketing budget.  There were polarized glasses under all our chairs and they slid out a movie theater sized screen to show 3D scenes from Kung Fu Panda.  Kind of a different feel from Werner's Intel Cluster Ready talk.

On entering the exhibition area, as usual, I went directly to the little booths at the back.  The nice thing about the little booths is that you probably haven't seen what they are showing before and usually, you get to talk to people that are the ones directly working with the technology.

The exhibit I found most interesting was one put on by an Intel "lablet" located somewhere in or near CMU.  The overall project is called Diamond and it is directed at interactive search of non-indexed data. Richard Gass and Mei Chen were demonstrating an application built on top of Diamond called Interactive Search Assisted Decision Support (ISADS) which allows very large numbers of images to be searched to find ones similar to an image of interest.

At the show, they had a special camera which they used to capture images of moles (yes, a charming photo of a mole on my forearm was taken...).  The system searches a large database of other images to find a similar match.

The search infrastructure lets the main application send "searchlets" to a distributed set of nodes that each search locally for matches. The fancy part is some interactivity on the balance between accuracy and number of matches. 

You don't get high performance without a good balance between CPU, memory and i/o.  Often, the CPU side gets too much attention.  It's always good to see people working on the full problem.  Hopefully readers of this blog already know that Star-P supports parallel file i/o!

More info on the Diamond project can also be found here



Article has 0 comments. Click To Read/Write Comments

Performance Tuning 101: Client-Server Communications and Vectorization

Posted by Roope Astala



Star-P is a client-server system. The client, running on your desktop, tells the high-performance server what to do, and the server spends its time happily crunching numbers.  The server doesn’t like to be micromanaged: it works the best when it can work on large chunks of data all on its own. It is "expensive" for client and server to communicate: after all, the client-server connection is often the slowest link in the hardware.

To illustrate the point, I’m showing the graph output of Star-P’s ppperf profiling tool.  The X-axis is time, and the Y-axis indicates the activity spent on the client, on the network, and on the server, for a given code segment. Going from left to right, it shows the communications activity - middle graph, in red - when a 72 MB matrix is sent from client to server. Then, the bottom (blue) graph shows the server activity increase in order to operate on the matrix. As you can see it takes more time to transmit the data than to do the actual computation.



In this first installment of Performance Tuning 101 series, we’ll discuss ways to reduce client-server communications traffic. Let’s first create some data using the *p tag: for those unfamiliar with Star-P syntax, it causes arrays to be created on the parallel server and distributed over CPUs.

x = rand(1,n*p);y=zeros(1,n*p);

By creating the data this way - directly on the server – we avoid data traffic through the client-server bottleneck.

Now, let’s do some number-crunching on these arrays to show the basic technique to avoid client-server traffic: vectorization. Compare element-wise multiplication in a for-loop

for idx=1:n
     y(idx) = 2*x(idx);
end


to a vectorized multiplication

y = 2*x;

In the for-loop, a command is sent from client to server in each iteration. This totals 1000 client-server calls, slowing down the code. Not only is the vectorized code simpler, but it requires only one client-server call.



Vectorization not only reduces client-server communications - it also lets you perform parallel operations on large arrays. Consider the following code that could be used to model 3-D random walks:

X=zeros(3,15000,15000);
for idx1=1:15000
     y = randn(3,15000);
     for idx2=1:15000
         x(1,idx1) = x(1,idx1) + y(1,idx2);
         x(2,idx1) = x(2,idx1) + y(2,idx2);
         x(3,idx1) = x(3,idx1) + y(3,idx2);
     end
end


This code took about 105 seconds to run in serial using regular MATLAB®. Just imagine how long it would take on a client-server platform! However, we can replace the for-loops by vectorized and parallelized operations:

X = sum(randn(3,15000,15000*p))

The speed-up is dramatic, this computation took only 4.5 seconds to run using 8 CPUs. In a sense we are trading memory for speed: we avoid the for-loop by constructing an extra 675 million element, 5.4 gigabyte array. This is where Star-P’s distributed arrays become very useful: a regular desktop might not have enough memory, so you’re forced to use the slower for-loops, whereas on a parallel server with large memory, you are free to restructure your code to use vectorized constructs.

So far, we have learned a couple of important lessons:

• Client-server communications is often the "weakest link" of a client-server system.
• When possible, avoid data traffic between client and server.
• Use vectorization.

In our next installments, we’ll take a look what’s happening on the server: what impact do your application types, data sizes and parallel server architectures have on performance tuning.



Article has 0 comments. Click To Read/Write Comments

Large Graphs in Star-P

Posted by Viral Shah



Many of you use Star-P, Matlab, Python, R, among other languages for numerical computing. Often, in many scientific applications, the behavior of individual elements of a system is explained with simple rules. However, when assembled into a system, they collectively exhibit complex behavior. For example, Newton wrote down the equation for the gravitational attraction between two particles, and it is deceptively simple. But, interesting things happen when a whole bunch of these interact with each other. Feel free to post your own examples in the comments.

This blog entry is not about n-body simulations, but about combinatorial computing augmenting numerical computing. At a basic level, graphs are a simple way to capture relationships between elements in a system. In a dynamic system, this graph may also evolve over time.

Most languages for numerical computing (including Star-P) support sparse matrices as first class citizens. It turns out, that sparse matrices and graphs are one and the same. So, its quite convenient to represent graphs as sparse matrices. This approach has many benefits; you don't have to write new data structures to store graphs, and algorithms to operate on graphs. Sparse matrix operations can be used to implement graph algorithms !

Lets say, we have a collection of edges (u, v, w) - There is an edge of weight w between nodes u and v. One way to store this graph is to just store the tuples (u, v, w) in dense vectors. However, not much can be done with a graph in such a data structure. To convert this into a sparse matrix, we use the sparse() command:

>> G = sparse (u, v, w)

An interesting thing to do immediately at this point is to visualize the graph with the spy command. Lets create a random graph for the purpose of illustration.

>> n = 1000;                         % Use n = 1000*p in Star-P
>> U = ceil(n*rand(10*n,1));   % from nodes
>> V = ceil(n*rand(10*n,1));   % to nodes
>> G = sparse(U, V, 1, n, n);  % n-node graph with ~10n edges

So far, we have created a graph with 1,000 nodes and 10,000 random edges between them. We use spy and spyy for visualization. spy is a Matlab built-in which displays every nonzero in a sparse matrix (every edge in a graph) as a dot. Clearly, when the graphs are very large, with tens of millions of nodes and edges, such visualization becomes meaningless. spyy is a 2D histogram of nonzero/edge densities.

>> spy (G)                     % in Matlab



>> spyy (G)                    % in Matlab or Star-P
Now, here's why all this is so cool. You can create distributed graphs in Star-P from edge tuples stored in distributed dense vectors. In other words, u, v, and w can be distributed dense vectors, and G will be a distributed sparse matrix, or a distributed graph. This idea can be developed further to do lots of interesting things. If this is something interesting, email me, or leave comments, and I will post future blog entries on the subject.



Article has 0 comments. Click To Read/Write Comments

Parallel computing at national labs and research centers

Posted by Ilya Mirman



We recently dropped by some of our customers' sites, and interviewed them regarding their organizations' work, computational challenges, tools.  Here are some video we put together from these discussions.

Whitehead MIT Bio-Imaging Institute
The Whitehead Institute-MIT BioImaging Center unites the power of leading-edge microscopy and advanced computational systems to study the structure, dynamics, and function of molecules in challenging problems faced by biology, medicine, and bioengineering.

We spoke with James Evans, Assistant Director at the Institute.


San Diego Supercomputer Center:
Founded in 1985, the San Diego Supercomputer Center (SDSC) enables international science and engineering discoveries through advances in computational science and high performance computing.

We spoke with Nancy Wilkins-Diehr, Associate Director of Scientific Computing at SDSC.




Here's John Gilbert, Viral Shah, and Brad McRae talking about their work at UC Santa Barbara and the National Center for Ecological Analysis & Synthesis:






And last but not least, Jack Collins of the National Cancer Institute:






Article has 0 comments. Click To Read/Write Comments

Types of Parallelism, and What’s Supported in Star-P Today

Posted by Steve Reinhardt



Even restricting ourselves to scientific, engineering, and mathematical users, as Star-P does today, there are many types of applications exhibiting many types of parallelism. The last 30+ years have seen many academic and industrial projects that created programming constructs for parallel processing. (Prominent examples being MPI, OpenMP, Unified Parallel C (UPC), Co-Array Fortran (CAF), and DCT.) Star-P cannot hope to incorporate all of these any time soon, but we clearly have to support common types of parallelism or Star-P won’t be very useful. This series of posts will describe the types of parallelism we support today (and some uses of those that may not be obvious), and important types of parallelism we don’t support now that we want to support soon.

Task parallelism (sometimes known as control parallelism) is probably the most general type of parallelism, in that all of the cooperating processes can be doing anything, with no expectation of coordination or homogeneity; i.e., one thread can control your toaster while another thread calculates your bank balance. We’ll come back to task parallelism in future postings.

Data parallelism is a special case of task parallelism, where the threads are not only executing the same operation at an instant, they’re also doing it cooperatively on portions of the same array(s). A simple example of data parallelism is scaling the values in an array by a scalar. For an array a and a scalar b, this could be expressed in Star-P’s MATLAB client as

c = a * b;

Many problems that depend on linear algebra for their solution wind up being readily expressed in this style, as do many FFT problems. Note that with this definition, the operation doesn’t have to be just element-wise; it could involve communication among the processors.

A common example is the sum operator, which adds together the elements in one dimension of an array, and whose result is one dimension smaller than the input. In the case of a sum in a distributed dimension of an array, that involves communication among the processors that own any elements of the array. All of the Star-P versions of data-parallel operators do the necessary communication implicitly, relieving the user of the burden of thinking of that level of detail.

These concepts allow lots of common MATLAB code constructs to run unchanged, once the initial matrices or arrays are defined as distributed. You might find it useful if your problem size has outgrown your desktop, say if you’re doing finite-element analysis and you wind up solving a matrix that’s grown very big. An example we frequently use is finding the eigenvector of a matrix, in this case a random matrix.

n = 10000*p;        % explicitly parallel with *p

A = rand(n, n);     % implicitly parallel from here
x = rand(n, 1);
y = zeros(size(x));
while norm(x-y) / norm(x) > 1e-11
    y = x;
    x = A*x;
    x = x / norm(x);
end;


In this example, the dimension n is multiplied by the symbolic variable p (hence “*p” or “Star-P”), denoting distribution across processors of the parallel server. The following lines, creating input arrays and doing linear algebra with them, could operate on either local or distributed data. A and x wind up being distributed because they’re based on n, which has the distributed attribute. y winds up distributed because the size function on the distributed array x returns the size with the distributed attribute (inherited from n). As A, x, and y are distributed, the rand, zeros, /, and * functions operate on the distributed data and create distributed objects (new versions of x and y) as output.

norm is a little different, as it produces a scalar result from any dimension input, so it can be used in conditional tests the way it is here. You can see how the original distributed attribute of n propagates to arrays that are created later. From a user point of view, just one identification of distribution or parallelism has enabled the entire code segment to run in parallel. Obviously most real programs don’t run on random data :), so initial data read in via I/O can also be done so that it’s distributed, with the same propagation effect as above.

Sparse matrices useful for less structured calculations

It may be easy for you to see how data parallel works for these common operations on dense matrices. Star-P also supports data parallel operations on sparse matrices, which can be very useful. For example, replacing the initial constructors of A and x above by sprand (the sparse random constructor) would result in all the arrays in the loop being distributed sparse. Sparse matrices are very useful for representing the stiffness matrices of finite element calculations and graphs for combinatorial calculations. We’ll revisit the usefulness of sparse matrices in more depth in a future posting.

    - Are sparse matrices important for your applications?
    - What do you use them for?
    - What operations do you need to work well on them?


Indexing results in communication on a parallel system
So far, the data parallel operations we’ve talked about have fixed patterns of using array elements, so they may seem somewhat formulaic. Indexing into arrays is a very general way to choose which elements will be used for an operation. And note that, in a parallel system, indexing can cause communication among the processors, which might be handy in certain situations. Consider an example from image processing where a q by r rectangular object resides in the image, and its coordinates are known. Another array with just the object can be created just by indexing into the larger distributed array, i.e.

object = image(i:i+q-1, j:j+r-1);

Similarly, indexing enables you to select elements that meet some criteria; for instance, we may want to select pressure values that correspond to temperatures above some threshold.

ndx = find(temperature, temperature>threshold);
selected_pressure = pressure(ndx);


Indexing has an interesting dual character; we can think of it just at the abstract level, where it does the intuitive thing we’d expect from the serial version of MATLAB, but we can also think a level deeper about how it moves data within the parallel computer. The point for now is that each distributed array is spread across all the processors in the parallel server, so the portion of image indicated by the indices might reside on just a few processors. Creating object redistributes the data to be spread across all the processors again, such that subsequent operations can have all processors participate since they’ll all have data, and similarly for ndx and selected_pressure. Indexing will be very useful for some future advanced topics.

    - Does your application readily map onto data-parallel constructs like these?
    - Are there special constraints or features you’d need to make them work for you?


Summary: In this posting we defined task- and data-parallelism and gave examples of data-parallelism from today’s Star-P constructs. Using these, many common MATLAB code sequences will work with no changes, once the input arrays are designated as distributed.



Article has 0 comments. Click To Read/Write Comments

“Computational Ecology”: Merging Wildlife and Electronic Circuits

Posted by Viral Shah



Researchers at the University of California, Santa Barbara (UCSB) are harnessing supercomputers and electronic circuit theory to help save wildlife from ever-shrinking habitats in an emerging scientific field called "computational ecology." The project is run by the University's National Center for Ecological Analysis and Synthesis (NCEAS)

Here's a little video summarizing this work:


.

NCEAS scientists are applying electronic circuit theory to model wildlife migration and gene flow across fragmented landscapes. The research could be instrumental in smart conservation planning, helping organizations decide which lands to preserve or restore - and where to best invest their tight conservation budgets - in order to preserve habitat and connectivity for wildlife populations.

Large Data Sets
Due to the massive volume of landscape data and the novel application of algorithms from circuit theory, NCEAS is working to speed up their code using state of the art sparse linear solvers, graph computations, vectorization and parallelization of their code. The result is a dramatic reduction in computing time from days to minutes on their 8-core server.

"It turns out that circuit theory shares a surprising number of properties with ecological theory describing animal movements and connectivity," says Brad McRae, the NCEAS project leader. "We can now represent landscapes as conductive surfaces - with features like forests and highways having different resistance to movement - and analyze connectivity across them using powerful circuit algorithms. Unlike standard conservation planning tools, these algorithms simultaneously incorporate all possible pathways when predicting how corridors, barriers, and other features affect movement and gene flow over large areas."


Corridors are areas that connect important habitats in human-altered landscapes. They provide natural avenues along which animals can travel, plants can propagate, genetic interchange can occur, species can move in response to environmental changes and natural disasters, and threatened populations can be replenished from other areas. A good example is "Y2Y," or the Yellowstone to Yukon corridor, where U.S. and Canadian conservation organizations are trying to identify which habitats to conserve to protect species from harmful decline or extinction.

Application to Multiple Species
In applying their software to these problems, NCEAS scientists have modeled mountain lion movements in Southern California to identify important connective habitats and corridors. In Central America they modeled how habitat connectivity affects gene flow among threatened populations of mahogany throughout the species' range. They are also analyzing connectivity among populations of wolverines, kit foxes and jaguars. For each species, researchers analyze geographic datasets representing habitat suitability over vast areas - in some cases spanning entire continents.

The challenge was choosing between how large or how finely-scaled the maps should be, explained McRae. "Even a relatively small region like the three-county area of Southern California can contain millions of raster cells, but our computing resources limited how finely we could grid those locations. While a mountain lion might perceive its habitat at a scale of about 100 meters, we originally had to increase the cell sizes to around a kilometer to keep our data requirements manageable," he said. "And even at these lower resolutions, running the models on a single-processor computer without optimized code took three days to complete."


Simulated connectivity among core habitat areas for mountain lions (courtesy Brett Dickson and Rick Hopkins, Live Oak Associates)

Working with Large Graphs
A key step of the NCEAS simulations is a computation on a large graph (or network) that represents the connectivity of the landscape. UCSB Computer Scientist Viral Shah worked with the NCEAS researchers to integrate their code with state of the art sparse linear solvers and the graph toolbox. Scientists can now model larger landscapes with much finer grids, while cutting computing time from days to minutes. The trend is for more applications to combine numerical and combinatorial methods to solve a problem, and tools like Star-P provide a convenient unified platform for numerical and combinatorial computation.




Related links:



Article has 0 comments. Click To Read/Write Comments

 
 

Subscribe by Email

Your email:
 
 

Latest Posts

 
 

Browse by Tag

 
 

Most Popular Posts