|
RSS Feed
Star-P's data parallel mode allows the user to run familiar M commands on very large datasets. One of the simplest, most widely use commands available to M users is sort. Star-P users have a parallel sort right at their fingertips. Lets take it for a spin. These results are on a shared memory Opteron system with 12 cores running Star-P. We create a vector with a billion random elements and sort it. This vector uses 8 GB of memory. >> a = rand(1e9*p, 1); >> tic; b = sort(a); toc Elapsed time is 40.712581 seconds.
Often, its not the sorted elements one wants, but the permutation that sorts the input. This takes a little longer, but is equally easily achieved with: >> tic; [b, perm] = sort(a); toc Elapsed time is 217.236630 seconds.
We performed some scalability tests on a couple of large systems. The following graph is results from sorting 100 billion elements on an Altix with 256 processors. This vector takes up almost 1TB of memory. Thats right. A TERABYTE. Star-P sorts it in a few minutes. 
We do not use shared memory systems exclusively for our scalability testing. The next graph shows results from a cluster with Gigabit Ethernet. Note, how the yellow line that shows communication time is higher than in the previous graph. On the Altix, communication is irrelevant. On a cluster, it is important to minimize communication. That is why we went to great lengths to minimize the amount of communication, which is almost optimal. If you are curious, you can see our paper on parallel sorting, or download the psort code (free for non-commercial, personal and research use) from which the Star-P sort is derived.
Article has 1 comments. Click To Read/Write Comments
One compelling reason to use Star-P is greatly improved productivity. Something that came as a surprise to us is the tremendous interest in Star-P internationally. Initially, I believed that the largest interest would be in developed economies with established engineering firms and universities such as US, UK, Germany, France, Japan, Australia etc. Thus, I was pleasantly surprised when I heard from our sales team about strong interest from emerging economies such as India, China, Brazil, Greece, Spain etc.
Countries like India and China are rapidly developing; playing catch-up definitely has its advantages. For instance, Asian countries have more modern telecommunication networks than the US. Starting late allows allocating resources to newer technology, and learning from the experiences of the trailblazers. Each country also faces its own unique set of challenges and competitive pressures. For example, Ajay Shah, one of India's top economists, says the following about cell phone companies in India: Roughly a decade ago, the standard engineering solutions that camefrom international telecom vendors induced prices for mobile telephonylike USD 0.1 per minute. In India, there was a unique bulge ofcustomers who were only available at lower prices. This market reality,coupled with competitive pressure, prompted Indian mobile phone vendorsto resort to an array of hardware and software innovations which haveinduced the lowest cost of mobile telephony in the world. Is something similar happening with high performance computing ? Are firms and universities in rest of the world leapfrogging older tools and technologies with a clean start ? Where will the next generation of innovations come from ?
A quick look at the statistics on the Top500 list shows that the developing nations are rapidly catching up. In 2001, China had 3 entries on the Top500, Brazil had 2, Russia had 1, and India had 0. Fast forward to 2007; we have China with 10, India with 9, Russia with 7, and Brazil with 1. India's placing at number 4 on Top500 in the latest list also generated some headlines. I wrote an opinion piece for the Financial Express in India with my views on the topic. The presence of the BRIC countries at the 2008 SIAM Parallel Processing meeting in Atlanta is more evidence on the state of affairs.
The cost of entry in high performance computing is quite high. After spending large budgets on Top500 computers, how do you program them ? The top US universities train some of the world's best parallel programmers. For an idea of what it takes to get started, see these classes at MIT, UC Santa Barbara, and UC Berkeley. Having been a teaching assistant for some of these, I believe that no more than 50 Computer Science and Engineering students take these classes every year at a given university. Thus, perhaps 500 students receive rigorous training in high performance scientific computing every year. So, how will the rest of the world program these computers ? Platforms such as Star-P bring high performance computing to the masses.
It is also refreshing to see that programs such as the DARPA/DOE HPCS are emphasizing productivity along with performance. This has resulted in three new languages: IBM's X10, Sun's Fortress, and Cray's Chapel. For a systematic approach to measuring productivity, see the work published by the HPCS productivity team. My views on the subject, of course, are in my thesis. Are programmers going to embrace these brand new programming languages that lack the kind of library support and rich experience that desktop environments such as MATLAB, Mathematica, R, and Python provide ? We'll have to wait a few years for the answer. In the meanwhile, platforms such as Star-P are rapidly bridging the gap between productivity and performance.
Article has 0 comments. Click To Read/Write Comments
Readers of this blog may recall an earlier posting on Circuitscape and landscape ecology. Shortly, Circuitscape uses circuit theory to predict patterns of connectivity, movement, and gene flow among plant and animal populations in heterogeneous landscapes. Circuitscape was originally written in Java. It has since been rewritten in MATLAB® and works without modification in Star-P.
UC Santa Barbara's College of Engineering featured our work in the Spring 2008 issue of Convergence magazine. Convergence is an award winning magazine with a circulation of roughly 20,000. This article invoked interest among faculty at UCSB creating opportunities for further collaboration. Watch this space for more updates on this exciting interdisciplinary project. 
Article has 0 comments. Click To Read/Write Comments
As data sets grow, and algorithms grow increasingly complex, there’s a need by engineers, scientists and analysts to increase performance. The first step is often to re-write their algorithm – originally coded in MATLAB® or another very high level language (VHLL) – into a lower level language, C, C++, or Fortran. A typical project may take several months, and result in a 5-10X performance gain on a typical workstation (Option 1 below).
Option 1: Port to C++
- ~6 person-months (~$120,000)
- 5-10X gain in performance on a single processor
- Calendar time to solution: 6 months
- Does not scale beyond a single processor without further work
In Option 1, it should be noted that this serial programming effort does not scale “for free” beyond a single processor. So while the 5-10X gain is a perfectly acceptable target for many projects, those needing more performance will need to take a different approach. To increase performance further, one can turn to clustered servers. Of course, this typically involves some degree of parallel programming, with the relatively low-level paradigm of message passing (MPI or OpenMP). Here’s some data from a recent survey we carried out, asking 25 organizations about their MPI-based development efforts; presented below are the histograms of team size, and project length. While parallel programming projects vary widely, it is common to see teams of several engineers working 1-2 years.  So, let’s consider a fairly typical example, when the required computing power outstrips a single desktop, and the decision is made to develop an MPI-based application running on HPC clusters. Option 2: Port to C++, with message passing (MPI)- Total incremental investment: $1,000,000
- ~48 person-months (~$900,000)
- 60X gain on a 128-core server (~$100,000)
- Calendar time to solution: 12-18 months
- Scales with more hardware, if higher performance is desired
Recently, a new programming paradigm has become available: Using existing VHLL code developed on a desktop, but extended to HPC server clusters with the Star-P software platform. This approach eliminates the C/MPI programming, and instead requires some incremental coding in the familiar MATLAB environment, leveraging much of the application’s existing code base. One can learn the handful of tags and commands within several days, and within a short number of weeks typical codes can be parallelized to run on the cluster. For a number of reasons, the processor utilization and compute efficiency may not be as high as on a “hand-tuned” custom MPI code, so a somewhat larger server may be necessary. (Fortunately, hardware is cheap, and getting cheaper.) Here’s how the numbers come out for the typical case: Option 3: Star-P extends MATLAB® to HPC Servers w/o MPI - Total incremental investment: $270,000
- 1 person-month (~$20,000)
- 60X gain on a 256-core server (~$200,000)
- Star-P license ($50,000)
- Calendar time to solution: 1 month
- Scales with more hardware, if higher performance is desired
So this new programming model offers us the flexibility to trade off labor cost and time savings versus hardware costs. Because many projects are constrained by calendar time and available technical resources, a solution such as Star-P offers a way to radically transform the “cost of performance” equation. Furthermore, this assessment only covers the short-term costs of the parallel port. In fact, most software costs are in maintenance of a code over time. In that case, the VHLL benefits of Star-P (faster and hence cheaper software development) will continue to pay off time after time, whereas the MPI-based approach will continue to cost substantially more. I am curious to hear your feedback on the argument laid out here, and how it may relate to your projects: - What do you do to increase performance for codes written in MATLAB®, Python, R, and other VHLLs?
- How long do your parallel ports take, with what size team?
- What are your thoughts on the notion of trading hardware efficiency for calendar time and labor costs?
Article has 1 comments. Click To Read/Write Comments
Star-P is a client-server system. The client, running on your desktop, tells the high-performance server what to do, and the server spends its time happily crunching numbers. The server doesn’t like to be micromanaged: it works the best when it can work on large chunks of data all on its own. It is "expensive" for client and server to communicate: after all, the client-server connection is often the slowest link in the hardware.
To illustrate the point, I’m showing the graph output of Star-P’s ppperf profiling tool. The X-axis is time, and the Y-axis indicates the activity spent on the client, on the network, and on the server, for a given code segment. Going from left to right, it shows the communications activity - middle graph, in red - when a 72 MB matrix is sent from client to server. Then, the bottom (blue) graph shows the server activity increase in order to operate on the matrix. As you can see it takes more time to transmit the data than to do the actual computation.

In this first installment of Performance Tuning 101 series, we’ll discuss ways to reduce client-server communications traffic. Let’s first create some data using the *p tag: for those unfamiliar with Star-P syntax, it causes arrays to be created on the parallel server and distributed over CPUs.
x = rand(1,n*p);y=zeros(1,n*p);
By creating the data this way - directly on the server – we avoid data traffic through the client-server bottleneck.
Now, let’s do some number-crunching on these arrays to show the basic technique to avoid client-server traffic: vectorization. Compare element-wise multiplication in a for-loop
for idx=1:n y(idx) = 2*x(idx); end
to a vectorized multiplication
y = 2*x;
In the for-loop, a command is sent from client to server in each iteration. This totals 1000 client-server calls, slowing down the code. Not only is the vectorized code simpler, but it requires only one client-server call.
 Vectorization not only reduces client-server communications - it also lets you perform parallel operations on large arrays. Consider the following code that could be used to model 3-D random walks:
X=zeros(3,15000,15000); for idx1=1:15000 y = randn(3,15000); for idx2=1:15000 x(1,idx1) = x(1,idx1) + y(1,idx2); x(2,idx1) = x(2,idx1) + y(2,idx2); x(3,idx1) = x(3,idx1) + y(3,idx2); end end
This code took about 105 seconds to run in serial using regular MATLAB®. Just imagine how long it would take on a client-server platform! However, we can replace the for-loops by vectorized and parallelized operations:
X = sum(randn(3,15000,15000*p))
The speed-up is dramatic, this computation took only 4.5 seconds to run using 8 CPUs. In a sense we are trading memory for speed: we avoid the for-loop by constructing an extra 675 million element, 5.4 gigabyte array. This is where Star-P’s distributed arrays become very useful: a regular desktop might not have enough memory, so you’re forced to use the slower for-loops, whereas on a parallel server with large memory, you are free to restructure your code to use vectorized constructs.
So far, we have learned a couple of important lessons:
• Client-server communications is often the "weakest link" of a client-server system. • When possible, avoid data traffic between client and server. • Use vectorization.
In our next installments, we’ll take a look what’s happening on the server: what impact do your application types, data sizes and parallel server architectures have on performance tuning.
Article has 0 comments. Click To Read/Write Comments
Last time I
mentioned that there were several dimensions in which the current ppeval mechanism could be generalized
and become more useful. We’ve thought
about stencil operations, that is,
operations that use a specific set of elements of an array repeatedly, and how
to extend task parallelism to support them.
[NOTE: These are just ideas, and
do not exist in the current Star-P and may never exist in a future Star-P
either. The new directives in the
example below are in blue to denote “blue sky” thinking.]
The following figure shows two example stencil operations on a 2D rectangular grid (in blue). The first, in pink, is known as a five-point stencil, and involves an element of the array and its north, south, east, and west neighbors.  The MATLAB® code below expresses a simple averaging operation, where the element itself is weighted as heavily as the sum of the four neighboring elements. newx(i,j)= 4*x(i,j) + x(i-1,j) + x(i,j-1) + x(i,j+1)+ x(i+1,j)
Those neighboring elements are sometimes known as a halo around the central element. Typically the averaging or weighting operation would be applied to each element in the array, with some special handling around the boundary to avoid indexing out of bounds. The second example, in green, is known as a nine-point stencil, and additionally involves the northwest, northeast, southeast, and southwest neighbors. The code uses the MATLAB blkproc function, which implements stencil operations itself via the supplied function fn.
newy = blkproc(x,[3 3],fn)
with fn defined in this case as:
function z = fn(x) z = sum(x(:))+7*x(2,2);
You can mentally extend these stencils to their 3D analogues
as well.
- Are stencils important to your codes?
- Does this description of stencils cover your usage, or are there other angles that I’ve missed?
The very fact of the stencil, or neighborhood, of elements is what requires some extra thought when it comes to parallelizing this construct. For parallelism to accelerate the operation, some of the elements of the array will need to reside on the memory associated with each processor core. When that core gets to the boundary elements, it will need to use neighboring elements that reside on another core. How does it get access to those elements, and how does it know that it is getting the most recent values of those elements? These are the hard points in parallelizing this construct. The following example illustrates both these hard points. It’s a perhaps-nonsensical extension of the MATLAB conv2 routine to include an extra argument that denotes the number of convolution steps to take. conv2 calculates the convolution of each element of its first argument a with the neighborhood represented by its second argument b. A simplified version of this code, that doesn’t scrupulously account for all end-cases, is shown following. function z = conv2(a,b,k) [an am] = size(a); [bn bm] = size(b); bn = fix(bn/2); bm = fix(bm/2); oldz = a; for ii=1:k for i=1:an for j=1:am %ignore out-of-bounds refs to oldz z(i,j) = oldz(i-bn:i+bn,j-bm:j+bm).*b; end end oldz = z; end
Annotating stencils for parallel executionWe want the i and j loops to run in parallel, but the ii loop cannot, as it depends on the results of the prior convolution. What is the minimum we could add to this code sequence to make it run correctly and fast, in parallel? For simplicity of expression, let’s assume we have directives that can express what we need, instead of having to rewrite the code itself. It’s clear we need something to denote that the i/j loops should run in parallel, but that would be true even if there were no stencil operation inside them. The added wrinkle of the stencil is that we need some way to ensure that the “ oldz = z” line results in each core getting the latest value of z from its neighbors before proceeding to the next iteration of the ii loop. One could imagine an update directive that would provide this functionality. The resulting code might look something like (as-yet-unimplemented directives in blue): function z = conv2(a,b,k) [an am] = size(a); [bn bm] = size(b); bn = fix(bn/2); bm = fix(bm/2); oldz = a; %%$starp tparallel bcast(b), halo(oldz,[bn bm]) for ii=1:k for i=1:an for j=1:am %ignore out-of-bounds refs to oldz z(i,j) = oldz(i-bn:i+bn,j-bm:j+bm).*b; end end oldz = z; %%$starp update(oldz) end %%$starp tparallel end
Note that in this specific case, the Star-P version of the conv2 routine would have this parallelism implemented internally, so you as a user wouldn’t have to think about it. But obviously you might have your own code where you need to implement something similar. - Do you see this as a reasonable interface for changing your code to specify parallelism?
- Does this overlook angles that need to be addressed?
Summary: We discussed one means of extending the current ppeval mechanism for a specific type of fixed-pattern computations.
Article has 0 comments. Click To Read/Write Comments
Last week's SC07 show in Reno, NV was a great time to connect with friends, customers, partners, and other folks interested in developing parallel applications for high performance computers.
ISC's Virtual Tour
In addition to our booth, we participated in several interesting sessions, such as the ClearSpeed User Forum, HPC Challenge, Cray's booth, Parallel MATLAB® session, and others. In case you missed something of interest, we've created a Virtual Tour of our SC07 activities. Here, you will findthe various presentations, product demonstrations, news announcements, customer videos, and awards.
P.S.: here's Ronnie setting up the booth, Aquil demonstrating an image processing app, and me with Austin Powers (told me later he prefers Python over other Very High Level Languages).

Article has 0 comments. Click To Read/Write Comments
This week at the SC07 conference, we won the HPC Challenge. Judged by a committee led by Jack Dongarra (University of Tennessee, and Oak Ridge National Lab) and Jeremy Kepner (MIT Lincoln Lab), the goal of the competition is to focus the HPC community’s attention on developing a broad set of HPC hardware and HPC software capabilities that are necessary to productively use HPC systems.
Our entry this year was done in the Python language, extended to a parallel environment with Star-P. We won in the category of "Most Productivity" - based 50% on performance, and 50% on code elegance, clarity, and size.
Python is a high-level, dynamically typed, multi-paradigm (procedural, objectoriented and functional) interpreted language created by Guido Van Russom. Natively, Python does not have data types and containers such as matrices and lacks linear algebra. and signal processing functions. Instead, these are added to the language through Python extension modules. Currently, the de facto Python module for numerical computing is NumPy authored by Travis Oliphant and others.
The Star-P package in Python is an extension module that can be imported just like any of the modules in the Python standard library. The syntax and semantics in Star-P Python closely model those in NumPy.
The basic premise of the model is to maintain compatibility in syntax with serial codes written using the NumPy module. In most cases, the user must not be burdened with having to think “in parallel”, keep track of distributions or worry about which portions of the code runs in serial and which in parallel. This allows users with a large existing serial application to port it to run in parallel with the least amount of effort.
The tests were run at a high performance cluster at the San Diego Supercomputing Center. The cluster consists of 32 nodes, where each node contains one quad-core Intel Xeon 5140 processor and 8 GB of memory, for a total of 128 cores and 256 GB of memory. The interconnect is IP over Infiniband.
Here are the scaling results for 3 of benchmarks. The full submission and presentation can be found here.

Article has 1 comments. Click To Read/Write Comments
Greetings from the SC07 Conference in Reno, NV! The show kicks off later today, and is shaping up to be pretty exciting.
One thing we are psyched about is our collaboration with Cray, where we are working together to enable Star-P to run on Cray's XT4 supercomputers. (Cray announced this earlier this morning.) The cool thing here is that it enables a whole new way to run applications on a Cray system, and makes Cray an extension of the scientist's desktop. Imagine coding an algorithm in a desktop tool such as Python, MATLAB®, R, and others, and then running the code - with potentially enormous data sets - on a Cray system, without coding a line of C++, Fortran, and MPI.
We got some work ahead of us to deliver this to early adopters in Q1 of 2008, but already have some pieces of Star-P's port to Cray working. One thing that makes it a pretty reasonable effort is that it's built on standard x86-64 chipset, running Linux.
Initial Porting Results As a proof of concept, we tweaked a couple things in Star-P and compiled it for the Cray, and ran a couple toy problems on (perhaps the world's smallest) Cray supercomputer :) - a 32-core XT4. The neat thing here is that it worked, and scaled nicely. For this test, we took 64 matrices of 3 different sizes (1000 x 1000, 1500 x 1500, and 2000 x 2000), and found the 2 most correlated vectors in each matrix, all done from Python (one of the very high-level languages Star-P supports).
We ran each problem on 1, 2, 4, 8, 16, and 32 processors of the XT4 system. Looks like pretty good scaling for a 1st run. The timing curve for the smallest problem (1000 x 1000 matrix size) starts to lean over slightly at 32 processors, likely due to the communication overhead associated with distributing the relatively small data set across 32 cores.

Article has 1 comments. Click To Read/Write Comments
We have just closed a financing round (Series B, $11M), and folks ask me what we’re planning to do with the funds. Broadly speaking, this falls into 2 categories: customer-driven product enhancement, and the distribution channel.
Our Engineering team has a lot of cool things in the works, but here’s a quick overview of some of these:
- Our largest investment will be to maintain the highest level of abstraction while dramatically increasing the run-time performance and global memory capability - well beyond what is typically encountered with interpreted very high level scripting languages typically used by domain experts (engineers, scientists, analysts).
- Broader language support: there’s a handful of popular desktop apps – the very high level languages (MATLAB, Python, R, Mathematica, etc.) – and the funding will help us accelerate development of our multi-language platform, in terms of both language breadth, and functional coverage within each language.
- Advances in automatic parallelization: having been in the marketplace for a couple of years now, we have confirmed the fact that scientists/engs/analysts want to move their algorithms to deployment on the fastest, biggest systems without the delays of large modficiations to their codes. As a result, we will be investing more in state-of-the-art automatic parallelization, to make it increasingly easier to take an algorithm developed on a desktop PC, and optimize it for running on a large cluster.
- Scaling to larger high performance computers: whereas today’s workgroup servers may contain 8-32 processors, tomorrow’s will contain hundreds and even thousands – enabled by microprocessor advancements, and demanded by increasingly growing data sets (e.g., from increasingly higher resolution MRIs, larger gene databases, hedge fund tick databases, etc.). With this funding, we will push the scaling limits up, in terms of both scaling efficiency, and in absolute terms.
On the channel side of things, we have a number of authorized resellers around the world, and will be working to expand this, in several key categories (more on this in the near future):
Article has 0 comments. Click To Read/Write Comments
Previous Page | Next Page
|