|
RSS Feed
One compelling reason to use Star-P is greatly improved productivity. Something that came as a surprise to us is the tremendous interest in Star-P internationally. Initially, I believed that the largest interest would be in developed economies with established engineering firms and universities such as US, UK, Germany, France, Japan, Australia etc. Thus, I was pleasantly surprised when I heard from our sales team about strong interest from emerging economies such as India, China, Brazil, Greece, Spain etc.
Countries like India and China are rapidly developing; playing catch-up definitely has its advantages. For instance, Asian countries have more modern telecommunication networks than the US. Starting late allows allocating resources to newer technology, and learning from the experiences of the trailblazers. Each country also faces its own unique set of challenges and competitive pressures. For example, Ajay Shah, one of India's top economists, says the following about cell phone companies in India: Roughly a decade ago, the standard engineering solutions that camefrom international telecom vendors induced prices for mobile telephonylike USD 0.1 per minute. In India, there was a unique bulge ofcustomers who were only available at lower prices. This market reality,coupled with competitive pressure, prompted Indian mobile phone vendorsto resort to an array of hardware and software innovations which haveinduced the lowest cost of mobile telephony in the world. Is something similar happening with high performance computing ? Are firms and universities in rest of the world leapfrogging older tools and technologies with a clean start ? Where will the next generation of innovations come from ?
A quick look at the statistics on the Top500 list shows that the developing nations are rapidly catching up. In 2001, China had 3 entries on the Top500, Brazil had 2, Russia had 1, and India had 0. Fast forward to 2007; we have China with 10, India with 9, Russia with 7, and Brazil with 1. India's placing at number 4 on Top500 in the latest list also generated some headlines. I wrote an opinion piece for the Financial Express in India with my views on the topic. The presence of the BRIC countries at the 2008 SIAM Parallel Processing meeting in Atlanta is more evidence on the state of affairs.
The cost of entry in high performance computing is quite high. After spending large budgets on Top500 computers, how do you program them ? The top US universities train some of the world's best parallel programmers. For an idea of what it takes to get started, see these classes at MIT, UC Santa Barbara, and UC Berkeley. Having been a teaching assistant for some of these, I believe that no more than 50 Computer Science and Engineering students take these classes every year at a given university. Thus, perhaps 500 students receive rigorous training in high performance scientific computing every year. So, how will the rest of the world program these computers ? Platforms such as Star-P bring high performance computing to the masses.
It is also refreshing to see that programs such as the DARPA/DOE HPCS are emphasizing productivity along with performance. This has resulted in three new languages: IBM's X10, Sun's Fortress, and Cray's Chapel. For a systematic approach to measuring productivity, see the work published by the HPCS productivity team. My views on the subject, of course, are in my thesis. Are programmers going to embrace these brand new programming languages that lack the kind of library support and rich experience that desktop environments such as MATLAB, Mathematica, R, and Python provide ? We'll have to wait a few years for the answer. In the meanwhile, platforms such as Star-P are rapidly bridging the gap between productivity and performance.
Article has 0 comments. Click To Read/Write Comments
Readers of this blog may recall an earlier posting on Circuitscape and landscape ecology. Shortly, Circuitscape uses circuit theory to predict patterns of connectivity, movement, and gene flow among plant and animal populations in heterogeneous landscapes. Circuitscape was originally written in Java. It has since been rewritten in MATLAB® and works without modification in Star-P.
UC Santa Barbara's College of Engineering featured our work in the Spring 2008 issue of Convergence magazine. Convergence is an award winning magazine with a circulation of roughly 20,000. This article invoked interest among faculty at UCSB creating opportunities for further collaboration. Watch this space for more updates on this exciting interdisciplinary project. 
Article has 0 comments. Click To Read/Write Comments
As data sets grow, and algorithms grow increasingly complex, there’s a need by engineers, scientists and analysts to increase performance. The first step is often to re-write their algorithm – originally coded in MATLAB® or another very high level language (VHLL) – into a lower level language, C, C++, or Fortran. A typical project may take several months, and result in a 5-10X performance gain on a typical workstation (Option 1 below).
Option 1: Port to C++
- ~6 person-months (~$120,000)
- 5-10X gain in performance on a single processor
- Calendar time to solution: 6 months
- Does not scale beyond a single processor without further work
In Option 1, it should be noted that this serial programming effort does not scale “for free” beyond a single processor. So while the 5-10X gain is a perfectly acceptable target for many projects, those needing more performance will need to take a different approach. To increase performance further, one can turn to clustered servers. Of course, this typically involves some degree of parallel programming, with the relatively low-level paradigm of message passing (MPI or OpenMP). Here’s some data from a recent survey we carried out, asking 25 organizations about their MPI-based development efforts; presented below are the histograms of team size, and project length. While parallel programming projects vary widely, it is common to see teams of several engineers working 1-2 years.  So, let’s consider a fairly typical example, when the required computing power outstrips a single desktop, and the decision is made to develop an MPI-based application running on HPC clusters. Option 2: Port to C++, with message passing (MPI)- Total incremental investment: $1,000,000
- ~48 person-months (~$900,000)
- 60X gain on a 128-core server (~$100,000)
- Calendar time to solution: 12-18 months
- Scales with more hardware, if higher performance is desired
Recently, a new programming paradigm has become available: Using existing VHLL code developed on a desktop, but extended to HPC server clusters with the Star-P software platform. This approach eliminates the C/MPI programming, and instead requires some incremental coding in the familiar MATLAB environment, leveraging much of the application’s existing code base. One can learn the handful of tags and commands within several days, and within a short number of weeks typical codes can be parallelized to run on the cluster. For a number of reasons, the processor utilization and compute efficiency may not be as high as on a “hand-tuned” custom MPI code, so a somewhat larger server may be necessary. (Fortunately, hardware is cheap, and getting cheaper.) Here’s how the numbers come out for the typical case: Option 3: Star-P extends MATLAB® to HPC Servers w/o MPI - Total incremental investment: $270,000
- 1 person-month (~$20,000)
- 60X gain on a 256-core server (~$200,000)
- Star-P license ($50,000)
- Calendar time to solution: 1 month
- Scales with more hardware, if higher performance is desired
So this new programming model offers us the flexibility to trade off labor cost and time savings versus hardware costs. Because many projects are constrained by calendar time and available technical resources, a solution such as Star-P offers a way to radically transform the “cost of performance” equation. Furthermore, this assessment only covers the short-term costs of the parallel port. In fact, most software costs are in maintenance of a code over time. In that case, the VHLL benefits of Star-P (faster and hence cheaper software development) will continue to pay off time after time, whereas the MPI-based approach will continue to cost substantially more. I am curious to hear your feedback on the argument laid out here, and how it may relate to your projects: - What do you do to increase performance for codes written in MATLAB®, Python, R, and other VHLLs?
- How long do your parallel ports take, with what size team?
- What are your thoughts on the notion of trading hardware efficiency for calendar time and labor costs?
Article has 1 comments. Click To Read/Write Comments
digg it | reddit | del.icio.us | StumbleUpon
Tags: parallel programming, sparse matrix, combinatorial computing, matlab, computational ecology, UCSB, NCEAS, productivity, circuit theory, graphs, large
Researchers at the University of California, Santa Barbara (UCSB) are harnessing supercomputers and electronic circuit theory to help save wildlife from ever-shrinking habitats in an emerging scientific field called "computational ecology." The project is run by the University's National Center for Ecological Analysis and Synthesis (NCEAS)
Here's a little video summarizing this work:
.
NCEAS scientists are applying electronic circuit theory to model wildlife migration and gene flow across fragmented landscapes. The research could be instrumental in smart conservation planning, helping organizations decide which lands to preserve or restore - and where to best invest their tight conservation budgets - in order to preserve habitat and connectivity for wildlife populations.
Large Data Sets Due to the massive volume of landscape data and the novel application of algorithms from circuit theory, NCEAS is working to speed up their code using state of the art sparse linear solvers, graph computations, vectorization and parallelization of their code. The result is a dramatic reduction in computing time from days to minutes on their 8-core server.
"It turns out that circuit theory shares a surprising number of properties with ecological theory describing animal movements and connectivity," says Brad McRae, the NCEAS project leader. "We can now represent landscapes as conductive surfaces - with features like forests and highways having different resistance to movement - and analyze connectivity across them using powerful circuit algorithms. Unlike standard conservation planning tools, these algorithms simultaneously incorporate all possible pathways when predicting how corridors, barriers, and other features affect movement and gene flow over large areas."
 
Corridors are areas that connect important habitats in human-altered landscapes. They provide natural avenues along which animals can travel, plants can propagate, genetic interchange can occur, species can move in response to environmental changes and natural disasters, and threatened populations can be replenished from other areas. A good example is "Y2Y," or the Yellowstone to Yukon corridor, where U.S. and Canadian conservation organizations are trying to identify which habitats to conserve to protect species from harmful decline or extinction.
Application to Multiple Species In applying their software to these problems, NCEAS scientists have modeled mountain lion movements in Southern California to identify important connective habitats and corridors. In Central America they modeled how habitat connectivity affects gene flow among threatened populations of mahogany throughout the species' range. They are also analyzing connectivity among populations of wolverines, kit foxes and jaguars. For each species, researchers analyze geographic datasets representing habitat suitability over vast areas - in some cases spanning entire continents.
The challenge was choosing between how large or how finely-scaled the maps should be, explained McRae. "Even a relatively small region like the three-county area of Southern California can contain millions of raster cells, but our computing resources limited how finely we could grid those locations. While a mountain lion might perceive its habitat at a scale of about 100 meters, we originally had to increase the cell sizes to around a kilometer to keep our data requirements manageable," he said. "And even at these lower resolutions, running the models on a single-processor computer without optimized code took three days to complete."
 Simulated connectivity among core habitat areas for mountain lions (courtesy Brett Dickson and Rick Hopkins, Live Oak Associates)
Working with Large Graphs A key step of the NCEAS simulations is a computation on a large graph (or network) that represents the connectivity of the landscape. UCSB Computer Scientist Viral Shah worked with the NCEAS researchers to integrate their code with state of the art sparse linear solvers and the graph toolbox. Scientists can now model larger landscapes with much finer grids, while cutting computing time from days to minutes. The trend is for more applications to combine numerical and combinatorial methods to solve a problem, and tools like Star-P provide a convenient unified platform for numerical and combinatorial computation.


Related links:
Article has 0 comments. Click To Read/Write Comments
At
Interactive Supercomputing, our mission is to improve productivity in
high performance technical computing. Of course, “productivity” means
very different things to different people, depending on the nature of
the project, the model or simulation, available hardware and
programming resources, available time, capital, etc. What I’d like to
do here is outline some common scenarios and definitions we have seen
from our customer base, and solicit your feedback on how you define
productivity.
Although each situation is unique, as we see more and
more customers and prospect scenarios the goals they seek seem to
naturally fall into one of roughly five categories:
1. “Breakthrough
acceleration at minimal effort.” These applications run too slow on a
desktop, and faster computation – say, 10X faster – would enable a
breakthrough. In one application we’ve seen, it took 45 minutes to
analyze an MRI scan of a brain. Cutting this down to just 5 minutes not
only can alleviate some patient anxiety in waiting for a diagnosis, but
is also quick enough to make another scan if needed while the patient
is still in the MRI exam. And in a very different application –
financial portfolio optimization – we saw that accelerating the
rebalancing of a portfolio enabled many more portfolios to be optimized
in time for next day’s trading.
2. “Working with bigger data sets”:
Similar to the “breakthrough acceleration” scenario above, the goal
here is the ability to run a larger model – one that may not fit on a
desktop. Even with 64-bit memory addressing, the required memory
footprint may be too much for a top-of-the-line workstation (that today
might top out at 8-16 GB of RAM), and the distributed memory of a
cluster may be a practical alternative.
3. “Time-to-solution” on an
important project. This can be on a critical path for a project, and
compressing the calendar time is key. Because application development
often makes up the vast majority of such projects (sometimes >75% of
the calendar time), compressing the programming workflow has enormous
leverage on the project goals.
4. Time and effort required to
“algorithmic exploration.” This is an interesting one. We see cases
when a customer is unsure of the final algorithm, but wants to be able
to prototype something quickly, run the model at full scale, and then
play with the algorithms and data sets. Roughly speaking, when
computation times take too long, there is no time to interact with the
problem – but at high performance speeds, interactivity becomes
possible not just on the data itself, but even on the approaches to
modeling the data.
5. “X Flops at Y hardware efficiency.” Some codes
are built to be run nearly continuously for years (even decades) – for
many different iterations, new input parameters, etc. We’ve heard of
codes at national labs that would literally take over 10,000 years to
run on a serial computer. They must therefore be run in parallel, on
large machine, with high efficiency.
So, depending on the nature of
the project, available resources and constraints, scientists and
engineers are turning to high performance computers with different
goals in mind, and thus different expectations and definitions of
productivity.
Defining Productivity
Conceptually, one way to define productivity is as follows:
Productivity = (application performance) / (application programming effort)
Qualitatively,
for a given programming approach, the more effort is expended, the more
performance can be achieved. The exact nature of this, of course,
depends on the approach:
1. MPI: MPI programming typically involves
long development periods, and discrete jumps in application performance
as new revisions are completed. It can take months and even years
before the first rev of a complex MPI application is completed. And it
may take weeks or months to tune the performance further. That said,
MPI codes typically achieve good performance, arguably at a high
development cost (in terms of both time and effort).
2. Desktop
Tools: On the other hand, very high level languages (such as Python,
or MATLAB® from The MathWorks), are ideally suited for rapid
application development and interactive algorithm refinement, and a
good deal of performance improvement techniques (memory pre-allocation,
vectorization, etc.) can be done relatively quickly, often within hours
or days. But on the desktop these tools increasingly run out of steam,
driving the scientists and analysts towards parallel servers and
clusters.
3. Star-P: Which brings us to the new parallel programming
model enabled by software such as ours. The idea is to use highly
expressive tools – such as MATLAB®, Python and R – and with a handful
of language extensions execute the simulations on parallel servers and
clusters. This rapidly delivers a good fraction of a parallel
computer’s capability, in terms of both speedup, and the ability to
work with larger data sets. But it comes at a price: the first
iteration – although ready months ahead of a corresponding MPI
implementation – may not perform as efficiently. In some cases that may
not be critical, if users can quickly get the advantages of the
parallel processors and big memory of an HPC. And if it is important,
serial and parallel performance can be optimized – with additional
effort, and can approach MPI efficiencies. In fact, if it’s
sufficiently important, efficient MPI codes can be plugged in via the
SDK.
So, in each case, there is some trade-off between effort and
achievable performance, although the three approaches may have
qualitatively different curves:

Questions for YOU...
So with all that said, here are some questions we would love to hear from you on:
• How do you define and measure performance? Speed of calculation?
Ability to run a model you previously could not? Something else?
• How do you define productivity in high performance technical computing?
• What might be a reasonable trade-off between performance and human
effort? For example, which would you prefer, and for what kinds of
projects:
1. The ability to run 5X faster than the desktop after 1 day of coding
2. The ability to run 25X faster than the desktop after 1 week of coding
3. The ability to run 125X faster than the desktop after 3 weeks of coding

• In a similar manner, if you come at this from a perspective of peak
hardware efficiency, what might be a reasonable trade-off between
performance and human effort? Again - which would you prefer, and for
what kinds of projects:
1. The ability to run at 90% hardware efficiency after 3 months of coding
2. The ability to run at 50% hardware efficiency after 3 weeks of coding
3. The ability to run at 30% hardware efficiency after 3 days of coding

• What are the key determinants in placing a project or target goal along the curve?
- Parallel programming experience of scientist/engineer/team?
- Size of programming team?
- Importance of accelerating time-to-solution?
- Phase of the project (early algorithm exploration versus later production runs)?
- Importance of interactive algorithm development?
- Memory required for the computation?
- Available hardware resources?
- Hardware efficiency / utilization?
- Something else?
Article has 0 comments. Click To Read/Write Comments
|