|
RSS Feed
The other week, I was interviewed on "Parallel Programming Talk Radio" by Aaron Tersteeg and Clay Breshears -- the show is 15 minutes and there are a few minutes of introduction before and a few after. The resulting ~10 minutes or so felt like an awfully short time to talk about Star-P! If you're new to Star-P, then this is perhaps a fast way to get an introduction, but if you already know what we're about, then there may not be enough new to make it worthwhile. On the other hand, since all of you are obviously parallel processors, why not play the interview, do your email and simultaneously run simulations in parallel on Star-P?
You can listen to the interview right here (see below) or if you'd like to click through BlogTalkRadio (and see what else goes on there), click here.
I attended IDF this past week and while there was quite a bit of noise about Nehalem and Intel's new 3GL parallel development tools, you can read about that stuff anywhere. I suppose you can also read somewhere about Jeffrey Katzenberg of Dreamworks giving a talk during the software keynote (about this)-- wow, I can't wait until we have that kind of marketing budget. There were polarized glasses under all our chairs and they slid out a movie theater sized screen to show 3D scenes from Kung Fu Panda. Kind of a different feel from Werner's Intel Cluster Ready talk. On entering the exhibition area, as usual, I went directly to the little booths at the back. The nice thing about the little booths is that you probably haven't seen what they are showing before and usually, you get to talk to people that are the ones directly working with the technology. The exhibit I found most interesting was one put on by an Intel "lablet" located somewhere in or near CMU. The overall project is called Diamond and it is directed at interactive search of non-indexed data. Richard Gass and Mei Chen were demonstrating an application built on top of Diamond called Interactive Search Assisted Decision Support (ISADS) which allows very large numbers of images to be searched to find ones similar to an image of interest. At the show, they had a special camera which they used to capture images of moles (yes, a charming photo of a mole on my forearm was taken...). The system searches a large database of other images to find a similar match. The search infrastructure lets the main application send "searchlets" to a distributed set of nodes that each search locally for matches. The fancy part is some interactivity on the balance between accuracy and number of matches. You don't get high performance without a good balance between CPU, memory and i/o. Often, the CPU side gets too much attention. It's always good to see people working on the full problem. Hopefully readers of this blog already know that Star-P supports parallel file i/o!
More info on the Diamond project can also be found here.
Article has 0 comments. Click To Read/Write Comments
Our partner in Israel, Emet Computing sent us this link. (Thanks Yoel!) High-Performance Computing Considered Harmful In this interview Greg Wilson talks about issues related to making HPC accessible to scientists, engineers and people who want to spend time solving problems with correct answers instead of pushing for peta-exa-zetaflops. I'll have to find an excuse to meet Greg at some point...
Article has 0 comments. Click To Read/Write Comments
Star-P's data parallel mode allows the user to run familiar M commands on very large datasets. One of the simplest, most widely use commands available to M users is sort. Star-P users have a parallel sort right at their fingertips. Lets take it for a spin. These results are on a shared memory Opteron system with 12 cores running Star-P. We create a vector with a billion random elements and sort it. This vector uses 8 GB of memory. >> a = rand(1e9*p, 1); >> tic; b = sort(a); toc Elapsed time is 40.712581 seconds.
Often, its not the sorted elements one wants, but the permutation that sorts the input. This takes a little longer, but is equally easily achieved with: >> tic; [b, perm] = sort(a); toc Elapsed time is 217.236630 seconds.
We performed some scalability tests on a couple of large systems. The following graph is results from sorting 100 billion elements on an Altix with 256 processors. This vector takes up almost 1TB of memory. Thats right. A TERABYTE. Star-P sorts it in a few minutes. 
We do not use shared memory systems exclusively for our scalability testing. The next graph shows results from a cluster with Gigabit Ethernet. Note, how the yellow line that shows communication time is higher than in the previous graph. On the Altix, communication is irrelevant. On a cluster, it is important to minimize communication. That is why we went to great lengths to minimize the amount of communication, which is almost optimal. If you are curious, you can see our paper on parallel sorting, or download the psort code (free for non-commercial, personal and research use) from which the Star-P sort is derived.
Article has 1 comments. Click To Read/Write Comments
One compelling reason to use Star-P is greatly improved productivity. Something that came as a surprise to us is the tremendous interest in Star-P internationally. Initially, I believed that the largest interest would be in developed economies with established engineering firms and universities such as US, UK, Germany, France, Japan, Australia etc. Thus, I was pleasantly surprised when I heard from our sales team about strong interest from emerging economies such as India, China, Brazil, Greece, Spain etc.
Countries like India and China are rapidly developing; playing catch-up definitely has its advantages. For instance, Asian countries have more modern telecommunication networks than the US. Starting late allows allocating resources to newer technology, and learning from the experiences of the trailblazers. Each country also faces its own unique set of challenges and competitive pressures. For example, Ajay Shah, one of India's top economists, says the following about cell phone companies in India: Roughly a decade ago, the standard engineering solutions that camefrom international telecom vendors induced prices for mobile telephonylike USD 0.1 per minute. In India, there was a unique bulge ofcustomers who were only available at lower prices. This market reality,coupled with competitive pressure, prompted Indian mobile phone vendorsto resort to an array of hardware and software innovations which haveinduced the lowest cost of mobile telephony in the world. Is something similar happening with high performance computing ? Are firms and universities in rest of the world leapfrogging older tools and technologies with a clean start ? Where will the next generation of innovations come from ?
A quick look at the statistics on the Top500 list shows that the developing nations are rapidly catching up. In 2001, China had 3 entries on the Top500, Brazil had 2, Russia had 1, and India had 0. Fast forward to 2007; we have China with 10, India with 9, Russia with 7, and Brazil with 1. India's placing at number 4 on Top500 in the latest list also generated some headlines. I wrote an opinion piece for the Financial Express in India with my views on the topic. The presence of the BRIC countries at the 2008 SIAM Parallel Processing meeting in Atlanta is more evidence on the state of affairs.
The cost of entry in high performance computing is quite high. After spending large budgets on Top500 computers, how do you program them ? The top US universities train some of the world's best parallel programmers. For an idea of what it takes to get started, see these classes at MIT, UC Santa Barbara, and UC Berkeley. Having been a teaching assistant for some of these, I believe that no more than 50 Computer Science and Engineering students take these classes every year at a given university. Thus, perhaps 500 students receive rigorous training in high performance scientific computing every year. So, how will the rest of the world program these computers ? Platforms such as Star-P bring high performance computing to the masses.
It is also refreshing to see that programs such as the DARPA/DOE HPCS are emphasizing productivity along with performance. This has resulted in three new languages: IBM's X10, Sun's Fortress, and Cray's Chapel. For a systematic approach to measuring productivity, see the work published by the HPCS productivity team. My views on the subject, of course, are in my thesis. Are programmers going to embrace these brand new programming languages that lack the kind of library support and rich experience that desktop environments such as MATLAB, Mathematica, R, and Python provide ? We'll have to wait a few years for the answer. In the meanwhile, platforms such as Star-P are rapidly bridging the gap between productivity and performance.
Article has 0 comments. Click To Read/Write Comments
Readers of this blog may recall an earlier posting on Circuitscape and landscape ecology. Shortly, Circuitscape uses circuit theory to predict patterns of connectivity, movement, and gene flow among plant and animal populations in heterogeneous landscapes. Circuitscape was originally written in Java. It has since been rewritten in MATLAB® and works without modification in Star-P.
UC Santa Barbara's College of Engineering featured our work in the Spring 2008 issue of Convergence magazine. Convergence is an award winning magazine with a circulation of roughly 20,000. This article invoked interest among faculty at UCSB creating opportunities for further collaboration. Watch this space for more updates on this exciting interdisciplinary project. 
Article has 0 comments. Click To Read/Write Comments
As data sets grow, and algorithms grow increasingly complex, there’s a need by engineers, scientists and analysts to increase performance. The first step is often to re-write their algorithm – originally coded in MATLAB® or another very high level language (VHLL) – into a lower level language, C, C++, or Fortran. A typical project may take several months, and result in a 5-10X performance gain on a typical workstation (Option 1 below).
Option 1: Port to C++
- ~6 person-months (~$120,000)
- 5-10X gain in performance on a single processor
- Calendar time to solution: 6 months
- Does not scale beyond a single processor without further work
In Option 1, it should be noted that this serial programming effort does not scale “for free” beyond a single processor. So while the 5-10X gain is a perfectly acceptable target for many projects, those needing more performance will need to take a different approach. To increase performance further, one can turn to clustered servers. Of course, this typically involves some degree of parallel programming, with the relatively low-level paradigm of message passing (MPI or OpenMP). Here’s some data from a recent survey we carried out, asking 25 organizations about their MPI-based development efforts; presented below are the histograms of team size, and project length. While parallel programming projects vary widely, it is common to see teams of several engineers working 1-2 years.  So, let’s consider a fairly typical example, when the required computing power outstrips a single desktop, and the decision is made to develop an MPI-based application running on HPC clusters. Option 2: Port to C++, with message passing (MPI)- Total incremental investment: $1,000,000
- ~48 person-months (~$900,000)
- 60X gain on a 128-core server (~$100,000)
- Calendar time to solution: 12-18 months
- Scales with more hardware, if higher performance is desired
Recently, a new programming paradigm has become available: Using existing VHLL code developed on a desktop, but extended to HPC server clusters with the Star-P software platform. This approach eliminates the C/MPI programming, and instead requires some incremental coding in the familiar MATLAB environment, leveraging much of the application’s existing code base. One can learn the handful of tags and commands within several days, and within a short number of weeks typical codes can be parallelized to run on the cluster. For a number of reasons, the processor utilization and compute efficiency may not be as high as on a “hand-tuned” custom MPI code, so a somewhat larger server may be necessary. (Fortunately, hardware is cheap, and getting cheaper.) Here’s how the numbers come out for the typical case: Option 3: Star-P extends MATLAB® to HPC Servers w/o MPI - Total incremental investment: $270,000
- 1 person-month (~$20,000)
- 60X gain on a 256-core server (~$200,000)
- Star-P license ($50,000)
- Calendar time to solution: 1 month
- Scales with more hardware, if higher performance is desired
So this new programming model offers us the flexibility to trade off labor cost and time savings versus hardware costs. Because many projects are constrained by calendar time and available technical resources, a solution such as Star-P offers a way to radically transform the “cost of performance” equation. Furthermore, this assessment only covers the short-term costs of the parallel port. In fact, most software costs are in maintenance of a code over time. In that case, the VHLL benefits of Star-P (faster and hence cheaper software development) will continue to pay off time after time, whereas the MPI-based approach will continue to cost substantially more. I am curious to hear your feedback on the argument laid out here, and how it may relate to your projects: - What do you do to increase performance for codes written in MATLAB®, Python, R, and other VHLLs?
- How long do your parallel ports take, with what size team?
- What are your thoughts on the notion of trading hardware efficiency for calendar time and labor costs?
Article has 1 comments. Click To Read/Write Comments
Star-P is a client-server system. The client, running on your desktop, tells the high-performance server what to do, and the server spends its time happily crunching numbers. The server doesn’t like to be micromanaged: it works the best when it can work on large chunks of data all on its own. It is "expensive" for client and server to communicate: after all, the client-server connection is often the slowest link in the hardware.
To illustrate the point, I’m showing the graph output of Star-P’s ppperf profiling tool. The X-axis is time, and the Y-axis indicates the activity spent on the client, on the network, and on the server, for a given code segment. Going from left to right, it shows the communications activity - middle graph, in red - when a 72 MB matrix is sent from client to server. Then, the bottom (blue) graph shows the server activity increase in order to operate on the matrix. As you can see it takes more time to transmit the data than to do the actual computation.

In this first installment of Performance Tuning 101 series, we’ll discuss ways to reduce client-server communications traffic. Let’s first create some data using the *p tag: for those unfamiliar with Star-P syntax, it causes arrays to be created on the parallel server and distributed over CPUs.
x = rand(1,n*p);y=zeros(1,n*p);
By creating the data this way - directly on the server – we avoid data traffic through the client-server bottleneck.
Now, let’s do some number-crunching on these arrays to show the basic technique to avoid client-server traffic: vectorization. Compare element-wise multiplication in a for-loop
for idx=1:n y(idx) = 2*x(idx); end
to a vectorized multiplication
y = 2*x;
In the for-loop, a command is sent from client to server in each iteration. This totals 1000 client-server calls, slowing down the code. Not only is the vectorized code simpler, but it requires only one client-server call.
 Vectorization not only reduces client-server communications - it also lets you perform parallel operations on large arrays. Consider the following code that could be used to model 3-D random walks:
X=zeros(3,15000,15000); for idx1=1:15000 y = randn(3,15000); for idx2=1:15000 x(1,idx1) = x(1,idx1) + y(1,idx2); x(2,idx1) = x(2,idx1) + y(2,idx2); x(3,idx1) = x(3,idx1) + y(3,idx2); end end
This code took about 105 seconds to run in serial using regular MATLAB®. Just imagine how long it would take on a client-server platform! However, we can replace the for-loops by vectorized and parallelized operations:
X = sum(randn(3,15000,15000*p))
The speed-up is dramatic, this computation took only 4.5 seconds to run using 8 CPUs. In a sense we are trading memory for speed: we avoid the for-loop by constructing an extra 675 million element, 5.4 gigabyte array. This is where Star-P’s distributed arrays become very useful: a regular desktop might not have enough memory, so you’re forced to use the slower for-loops, whereas on a parallel server with large memory, you are free to restructure your code to use vectorized constructs.
So far, we have learned a couple of important lessons:
• Client-server communications is often the "weakest link" of a client-server system. • When possible, avoid data traffic between client and server. • Use vectorization.
In our next installments, we’ll take a look what’s happening on the server: what impact do your application types, data sizes and parallel server architectures have on performance tuning.
Article has 0 comments. Click To Read/Write Comments
Last time I
mentioned that there were several dimensions in which the current ppeval mechanism could be generalized
and become more useful. We’ve thought
about stencil operations, that is,
operations that use a specific set of elements of an array repeatedly, and how
to extend task parallelism to support them.
[NOTE: These are just ideas, and
do not exist in the current Star-P and may never exist in a future Star-P
either. The new directives in the
example below are in blue to denote “blue sky” thinking.]
The following figure shows two example stencil operations on a 2D rectangular grid (in blue). The first, in pink, is known as a five-point stencil, and involves an element of the array and its north, south, east, and west neighbors.  The MATLAB® code below expresses a simple averaging operation, where the element itself is weighted as heavily as the sum of the four neighboring elements. newx(i,j)= 4*x(i,j) + x(i-1,j) + x(i,j-1) + x(i,j+1)+ x(i+1,j)
Those neighboring elements are sometimes known as a halo around the central element. Typically the averaging or weighting operation would be applied to each element in the array, with some special handling around the boundary to avoid indexing out of bounds. The second example, in green, is known as a nine-point stencil, and additionally involves the northwest, northeast, southeast, and southwest neighbors. The code uses the MATLAB blkproc function, which implements stencil operations itself via the supplied function fn.
newy = blkproc(x,[3 3],fn)
with fn defined in this case as:
function z = fn(x) z = sum(x(:))+7*x(2,2);
You can mentally extend these stencils to their 3D analogues
as well.
- Are stencils important to your codes?
- Does this description of stencils cover your usage, or are there other angles that I’ve missed?
The very fact of the stencil, or neighborhood, of elements is what requires some extra thought when it comes to parallelizing this construct. For parallelism to accelerate the operation, some of the elements of the array will need to reside on the memory associated with each processor core. When that core gets to the boundary elements, it will need to use neighboring elements that reside on another core. How does it get access to those elements, and how does it know that it is getting the most recent values of those elements? These are the hard points in parallelizing this construct. The following example illustrates both these hard points. It’s a perhaps-nonsensical extension of the MATLAB conv2 routine to include an extra argument that denotes the number of convolution steps to take. conv2 calculates the convolution of each element of its first argument a with the neighborhood represented by its second argument b. A simplified version of this code, that doesn’t scrupulously account for all end-cases, is shown following. function z = conv2(a,b,k) [an am] = size(a); [bn bm] = size(b); bn = fix(bn/2); bm = fix(bm/2); oldz = a; for ii=1:k for i=1:an for j=1:am %ignore out-of-bounds refs to oldz z(i,j) = oldz(i-bn:i+bn,j-bm:j+bm).*b; end end oldz = z; end
Annotating stencils for parallel executionWe want the i and j loops to run in parallel, but the ii loop cannot, as it depends on the results of the prior convolution. What is the minimum we could add to this code sequence to make it run correctly and fast, in parallel? For simplicity of expression, let’s assume we have directives that can express what we need, instead of having to rewrite the code itself. It’s clear we need something to denote that the i/j loops should run in parallel, but that would be true even if there were no stencil operation inside them. The added wrinkle of the stencil is that we need some way to ensure that the “ oldz = z” line results in each core getting the latest value of z from its neighbors before proceeding to the next iteration of the ii loop. One could imagine an update directive that would provide this functionality. The resulting code might look something like (as-yet-unimplemented directives in blue): function z = conv2(a,b,k) [an am] = size(a); [bn bm] = size(b); bn = fix(bn/2); bm = fix(bm/2); oldz = a; %%$starp tparallel bcast(b), halo(oldz,[bn bm]) for ii=1:k for i=1:an for j=1:am %ignore out-of-bounds refs to oldz z(i,j) = oldz(i-bn:i+bn,j-bm:j+bm).*b; end end oldz = z; %%$starp update(oldz) end %%$starp tparallel end
Note that in this specific case, the Star-P version of the conv2 routine would have this parallelism implemented internally, so you as a user wouldn’t have to think about it. But obviously you might have your own code where you need to implement something similar. - Do you see this as a reasonable interface for changing your code to specify parallelism?
- Does this overlook angles that need to be addressed?
Summary: We discussed one means of extending the current ppeval mechanism for a specific type of fixed-pattern computations.
Article has 0 comments. Click To Read/Write Comments
Last week's SC07 show in Reno, NV was a great time to connect with friends, customers, partners, and other folks interested in developing parallel applications for high performance computers.
ISC's Virtual Tour
In addition to our booth, we participated in several interesting sessions, such as the ClearSpeed User Forum, HPC Challenge, Cray's booth, Parallel MATLAB® session, and others. In case you missed something of interest, we've created a Virtual Tour of our SC07 activities. Here, you will findthe various presentations, product demonstrations, news announcements, customer videos, and awards.
P.S.: here's Ronnie setting up the booth, Aquil demonstrating an image processing app, and me with Austin Powers (told me later he prefers Python over other Very High Level Languages).

Article has 0 comments. Click To Read/Write Comments
Previous Page | Next Page
|