|
RSS Feed
As data sets grow, and algorithms grow increasingly complex, there’s a need by engineers, scientists and analysts to increase performance. The first step is often to re-write their algorithm – originally coded in MATLAB® or another very high level language (VHLL) – into a lower level language, C, C++, or Fortran. A typical project may take several months, and result in a 5-10X performance gain on a typical workstation (Option 1 below).
Option 1: Port to C++
- ~6 person-months (~$120,000)
- 5-10X gain in performance on a single processor
- Calendar time to solution: 6 months
- Does not scale beyond a single processor without further work
In Option 1, it should be noted that this serial programming effort does not scale “for free” beyond a single processor. So while the 5-10X gain is a perfectly acceptable target for many projects, those needing more performance will need to take a different approach. To increase performance further, one can turn to clustered servers. Of course, this typically involves some degree of parallel programming, with the relatively low-level paradigm of message passing (MPI or OpenMP). Here’s some data from a recent survey we carried out, asking 25 organizations about their MPI-based development efforts; presented below are the histograms of team size, and project length. While parallel programming projects vary widely, it is common to see teams of several engineers working 1-2 years.  So, let’s consider a fairly typical example, when the required computing power outstrips a single desktop, and the decision is made to develop an MPI-based application running on HPC clusters. Option 2: Port to C++, with message passing (MPI)- Total incremental investment: $1,000,000
- ~48 person-months (~$900,000)
- 60X gain on a 128-core server (~$100,000)
- Calendar time to solution: 12-18 months
- Scales with more hardware, if higher performance is desired
Recently, a new programming paradigm has become available: Using existing VHLL code developed on a desktop, but extended to HPC server clusters with the Star-P software platform. This approach eliminates the C/MPI programming, and instead requires some incremental coding in the familiar MATLAB environment, leveraging much of the application’s existing code base. One can learn the handful of tags and commands within several days, and within a short number of weeks typical codes can be parallelized to run on the cluster. For a number of reasons, the processor utilization and compute efficiency may not be as high as on a “hand-tuned” custom MPI code, so a somewhat larger server may be necessary. (Fortunately, hardware is cheap, and getting cheaper.) Here’s how the numbers come out for the typical case: Option 3: Star-P extends MATLAB® to HPC Servers w/o MPI - Total incremental investment: $270,000
- 1 person-month (~$20,000)
- 60X gain on a 256-core server (~$200,000)
- Star-P license ($50,000)
- Calendar time to solution: 1 month
- Scales with more hardware, if higher performance is desired
So this new programming model offers us the flexibility to trade off labor cost and time savings versus hardware costs. Because many projects are constrained by calendar time and available technical resources, a solution such as Star-P offers a way to radically transform the “cost of performance” equation. Furthermore, this assessment only covers the short-term costs of the parallel port. In fact, most software costs are in maintenance of a code over time. In that case, the VHLL benefits of Star-P (faster and hence cheaper software development) will continue to pay off time after time, whereas the MPI-based approach will continue to cost substantially more. I am curious to hear your feedback on the argument laid out here, and how it may relate to your projects: - What do you do to increase performance for codes written in MATLAB®, Python, R, and other VHLLs?
- How long do your parallel ports take, with what size team?
- What are your thoughts on the notion of trading hardware efficiency for calendar time and labor costs?
Article has 1 comments. Click To Read/Write Comments
Multicore chips have received tons of attention recently from industry pundits. What is multicore and why should you as a scientist or engineer who uses computing care about it?
Multicore refers to the physical placement of more than one core on a single chip. Note that terminology can be confusing here, as hardware people tend to refer to the chip as a “processor” while software people tend to refer to the core as the “processor”. I will use the term core to refer to what we all called a “processor” ten years ago (i.e., the thing that can execute a program, has registers and memory access, etc.) and the term socket to refer to the whole chip. Multicore has happened because the hardware architects have run out of ways to use the exponential growth in transistors per unit space (made famous as Moore’s Law) to make a single core faster, and so have just put in multiple instances of a fast core on a single die. [The Landscape of Parallel Computing Research: A View From Berkeley, Multicore Programming Primer]
Given that the growth in transistors doesn’t appear to be slowing down and that the number of transistors in a core appears to have reached its asymptote, those extra transistors will be consumed by more cores, with the number of cores growing exponentially over the next several years. If we tag 2006 as the widespread advent of general-purpose multicore (dual-core) chips and extrapolate with Moore’s Law (a factor of 4 every 3 years), that says 2009 will bring 8-core sockets and 2012 32-core sockets. And those cores may not be getting much faster from generation to generation, unlike the last several years, so delivered performance improvements will depend heavily on making use of those extra cores. In a computing world where parallelism was a niche technology until recently, this is a Big Deal.
Multicore chips come in several flavors. Intel and AMD build general-purpose multicore sockets [Intel quad core, AMD quad core] whose cores execute the x86-64 instruction set and today are cache coherent across the socket. A number of other vendors build multicore chips with other ins truction sets, sometimes for special purposes, and they have already pushed core counts higher.
- SiCortex, for instance, builds a low-power 6-core chip today based on the MIPS instruction set.
- Sun’s UltraSPARC T2 processor has 8 cores, each capable of running 8 threads quasi-simultaneously for a total of 64 threads on a socket.
- The IBM/Sony/Toshiba Cell processor has one general-purpose core accelerated by 8 GPU-like cores.
- Tilera just announced its 64-core TILE64™ Tile Processor.
- The NVIDIA Tesla™ graphics processing unit (GPU) has 128 cores on a socket.
These special-purpose sockets often do not provide cache coherence across the whole chip, which makes them simpler to design and less power-hungry, and this probably gives a hint for where future general-purpose sockets will go as well. (Other hardware innovations like transactional memory [http://www.theregister.co.uk/2007/08/21/sun_transactional_memory_rock/] may improve the primitives used for synchronization, but it’s far from clear that those will be simple enough for use by a typical scientist or engineer who doesn’t know much about parallelism.) All of these cores represent lots of computing that can potentially be done on a chip, with potentially being the operative word. To make a single application run faster on a multi-core, it has to be structured to take advantage of multiple processors. Outside of the rarified world of high performance computing (HPC) and enterprise applications like commercial databases, not many applications are ready to run in parallel. Further, lots of the attention for multicore chips has focused on the peak performance of the cores aggregated together. For instance, system vendors using AMD’s Barcelona quad-core socket focus on the peak speed for floating-point ops, 16 GFLOPS with a 2GHz clock. Like all good marketing organizations, the hardware vendors are trying to focus our attention on the improved attributes of their new products while obscuring the not-so-improved attributes. Here lies one of the major pitfalls of the multicore sockets – they typically don’t have enough memory bandwidth to support the high FLOP rates of the cores. The hardware vendors aren’t venal in this change, they’re just doing what’s practical. Adding computation (more cores) is relatively straightforward and inexpensive. Adding bandwidth (more and faster off-chip pins) is difficult and expensive. Until somebody comes up with the next great idea in computers, we will see continued confirmation of Anant Agrawal’s mantra “Computation is cheap, [off-chip] communication is expensive.” Is this description consistent with your view of the future hardware landscape? Have I missed anything important?The impact of these chips on you as an engineer or scientist mainly depends on how you do your computing today.
- If you do most of your work on the desktop, you have probably used or written serial applications and may still be doing so. Changing from that world to a 8-core/socket world by 2009 is a huge shift, whose effects are hard to overstate. One way to look at that is that if you run a serial application on a 8-core socket, you’ll only be running at 12% of the potential speed of your chip. Not many disciplines or industries are so uncompetitive that one can afford to give away a factor of 8 and survive.
- If you do most of your work on HPC systems, you’ve probably been using or writing parallel applications, so multicore may not be such a drastic change from what you’re used to. But the physical realities of the new chips will have a major qualitative effect on programming HPC systems built from them. First, the number of available cores will stress the scaling of the algorithms in most existing parallel applications. (I.e., your nicely performing 32-core program in 2005 will need to be running just as well on 1,024 cores by 2012 to keep pace with Moore’s Law; not many people know how to do that.) Second, the performance characteristics of the multicore chips will require parallel programs to schedule remote communication much the way that I/O-intensive workloads schedule I/O today; careful identification of data that can be pre-fetched, with the hope that sufficient pre-fetching will result in data being available soon enough to avoid excessive waiting for it.
So why would we as an industry put up with all this grief from multicore chips? Simple – we want programs to run way faster, and they offer the best near-term path to get there. So I’ve covered the basics of multicore chips and their tremendous potential and notable pitfalls. For most of us, we have little alternative but to figure out how to use these chips to their utmost. Let’s look at how one might program these chips effectively. SEA-CHANGE FOR DESKTOP PROGRAMMERSIf you run applications you get from other organizations on your desktop, your fate is in their hands. You may want to understand what those suppliers are doing about multicore, so that you can look for alternatives if they’re not responding to this technology shift. If you write your own applications for the desktop, however, you’ll be facing the challenge of multicore yourself, and so probably have some new requirements for the programming tools you use. Assuming you’re not already conversant with parallel programming, you’ll probably want a gentle initiation. Unfortunately, “gentle” and “parallel programming” are rarely used in the same sentence without a “not”. :) [Note that with Star-P, we strive to do much better than prior approaches.] For the rest of this discussion, we’ll assume you’re programming in a very high-level language (VHLL), such as Python, MATLAB® , Mathematica, R, etc. First, you’ll want an easy way to identify the parts of your program that consume all the time. Then you’ll want straightforward mechanisms that allow you to change only a few places in your program to get it to run in parallel, using all the cores in your desktop computer. And you’ll need to debug any errors you make along the way and look at the resulting performance to know whether it’s sufficient. Popping up a level, you’ll want to know that the work you’re doing for parallelism will be sustainable to the next generations of chips that have even more cores, and that you can further refine your program to expose more parallelism rather than starting from scratch. And, given the work you’re putting into going parallel, you may want the flexibility to run larger problems that don’t fit on your desktop, without having to redo all your work. - If you’re a VHLL programmer, do you understand parallelism well enough to see how you could structure your application to expose the parallelism, or would you need help? - What type of information would you expect a tool to give you?
SCALE CHANGE FOR HPC PROGRAMMERSIf you’re already doing parallel programming for HPCs, let’s assume that, like most people, you’re using the Message Passing Interface (MPI) with C++/C or Fortran. For you, multicore primarily means changes in scale and performance balance. By scale, I mean that the number of cores you need to be able to use effectively, to keep pace with Moore’s Law, will grow exponentially. Compounding this effect is the relative reduction in bandwidth that has happened with early multicore chips and which we expect to continue. Some of my colleagues at ISC and I believe this will have dramatic impacts on how people will do detailed parallel programming. Whereas a 2005-era parallel program may have been able to survive with a simple sequence of steps, all done cooperatively by the executing cores, a 2009- or 2012-era parallel program will have to be much more aware of the scheduling of computation and communication between the cores to tolerate the inter-chip latencies and yield good performance. In many cases this will result in a single program step from the 2005-era program being deconstructed to smaller parts, which will then be scheduled with prefetching, overlapping of computation and communication, and synchronization. Implementing these types of needed optimizations will be difficult using a low-level interface like MPI, and we believe that you’ll want new abstractions to make this a tractable problem. - If you’re an MPI programmer, are you intending to continue with it or are you considering shifting to new interfaces to cope with multicore?While Star-P already runs on current multicore x86-64 and Itanium chips, we’re currently in the midst of designing changes to Star-P that will address multicore chips more robustly. These are the issues we see facing you as a programmer for multicore systems; we’re using them as requirements for our design work. - Does this match what you think you’ll want from your parallel programming tools? - If you had a very-potent-but-not-quite-magic wand, what would you want those tools to do?[Next time: Expressing parallelism in Star-P.]
Article has 5 comments. Click To Read/Write Comments
At
Interactive Supercomputing, our mission is to improve productivity in
high performance technical computing. Of course, “productivity” means
very different things to different people, depending on the nature of
the project, the model or simulation, available hardware and
programming resources, available time, capital, etc. What I’d like to
do here is outline some common scenarios and definitions we have seen
from our customer base, and solicit your feedback on how you define
productivity.
Although each situation is unique, as we see more and
more customers and prospect scenarios the goals they seek seem to
naturally fall into one of roughly five categories:
1. “Breakthrough
acceleration at minimal effort.” These applications run too slow on a
desktop, and faster computation – say, 10X faster – would enable a
breakthrough. In one application we’ve seen, it took 45 minutes to
analyze an MRI scan of a brain. Cutting this down to just 5 minutes not
only can alleviate some patient anxiety in waiting for a diagnosis, but
is also quick enough to make another scan if needed while the patient
is still in the MRI exam. And in a very different application –
financial portfolio optimization – we saw that accelerating the
rebalancing of a portfolio enabled many more portfolios to be optimized
in time for next day’s trading.
2. “Working with bigger data sets”:
Similar to the “breakthrough acceleration” scenario above, the goal
here is the ability to run a larger model – one that may not fit on a
desktop. Even with 64-bit memory addressing, the required memory
footprint may be too much for a top-of-the-line workstation (that today
might top out at 8-16 GB of RAM), and the distributed memory of a
cluster may be a practical alternative.
3. “Time-to-solution” on an
important project. This can be on a critical path for a project, and
compressing the calendar time is key. Because application development
often makes up the vast majority of such projects (sometimes >75% of
the calendar time), compressing the programming workflow has enormous
leverage on the project goals.
4. Time and effort required to
“algorithmic exploration.” This is an interesting one. We see cases
when a customer is unsure of the final algorithm, but wants to be able
to prototype something quickly, run the model at full scale, and then
play with the algorithms and data sets. Roughly speaking, when
computation times take too long, there is no time to interact with the
problem – but at high performance speeds, interactivity becomes
possible not just on the data itself, but even on the approaches to
modeling the data.
5. “X Flops at Y hardware efficiency.” Some codes
are built to be run nearly continuously for years (even decades) – for
many different iterations, new input parameters, etc. We’ve heard of
codes at national labs that would literally take over 10,000 years to
run on a serial computer. They must therefore be run in parallel, on
large machine, with high efficiency.
So, depending on the nature of
the project, available resources and constraints, scientists and
engineers are turning to high performance computers with different
goals in mind, and thus different expectations and definitions of
productivity.
Defining Productivity
Conceptually, one way to define productivity is as follows:
Productivity = (application performance) / (application programming effort)
Qualitatively,
for a given programming approach, the more effort is expended, the more
performance can be achieved. The exact nature of this, of course,
depends on the approach:
1. MPI: MPI programming typically involves
long development periods, and discrete jumps in application performance
as new revisions are completed. It can take months and even years
before the first rev of a complex MPI application is completed. And it
may take weeks or months to tune the performance further. That said,
MPI codes typically achieve good performance, arguably at a high
development cost (in terms of both time and effort).
2. Desktop
Tools: On the other hand, very high level languages (such as Python,
or MATLAB® from The MathWorks), are ideally suited for rapid
application development and interactive algorithm refinement, and a
good deal of performance improvement techniques (memory pre-allocation,
vectorization, etc.) can be done relatively quickly, often within hours
or days. But on the desktop these tools increasingly run out of steam,
driving the scientists and analysts towards parallel servers and
clusters.
3. Star-P: Which brings us to the new parallel programming
model enabled by software such as ours. The idea is to use highly
expressive tools – such as MATLAB®, Python and R – and with a handful
of language extensions execute the simulations on parallel servers and
clusters. This rapidly delivers a good fraction of a parallel
computer’s capability, in terms of both speedup, and the ability to
work with larger data sets. But it comes at a price: the first
iteration – although ready months ahead of a corresponding MPI
implementation – may not perform as efficiently. In some cases that may
not be critical, if users can quickly get the advantages of the
parallel processors and big memory of an HPC. And if it is important,
serial and parallel performance can be optimized – with additional
effort, and can approach MPI efficiencies. In fact, if it’s
sufficiently important, efficient MPI codes can be plugged in via the
SDK.
So, in each case, there is some trade-off between effort and
achievable performance, although the three approaches may have
qualitatively different curves:

Questions for YOU...
So with all that said, here are some questions we would love to hear from you on:
• How do you define and measure performance? Speed of calculation?
Ability to run a model you previously could not? Something else?
• How do you define productivity in high performance technical computing?
• What might be a reasonable trade-off between performance and human
effort? For example, which would you prefer, and for what kinds of
projects:
1. The ability to run 5X faster than the desktop after 1 day of coding
2. The ability to run 25X faster than the desktop after 1 week of coding
3. The ability to run 125X faster than the desktop after 3 weeks of coding

• In a similar manner, if you come at this from a perspective of peak
hardware efficiency, what might be a reasonable trade-off between
performance and human effort? Again - which would you prefer, and for
what kinds of projects:
1. The ability to run at 90% hardware efficiency after 3 months of coding
2. The ability to run at 50% hardware efficiency after 3 weeks of coding
3. The ability to run at 30% hardware efficiency after 3 days of coding

• What are the key determinants in placing a project or target goal along the curve?
- Parallel programming experience of scientist/engineer/team?
- Size of programming team?
- Importance of accelerating time-to-solution?
- Phase of the project (early algorithm exploration versus later production runs)?
- Importance of interactive algorithm development?
- Memory required for the computation?
- Available hardware resources?
- Hardware efficiency / utilization?
- Something else?
Article has 0 comments. Click To Read/Write Comments
|