Friday, August 21, 2015

The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software

http://www.gotw.ca/publications/concurrency-ddj.htm

The Free Lunch Is Over
A Fundamental Turn Toward Concurrency in Software

By Herb Sutter
The biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency.
This article appeared in Dr. Dobb's Journal, 30(3), March 2005. A much briefer version under the title "The Concurrency Revolution" appeared in C/C++ Users Journal, 23(2), February 2005.

Update note: The CPU trends graph last updated August 2009 to include current data and show the trend continues as predicted. The rest of this article including all text is still original as first posted here in December 2004.

Your free lunch will soon be over. What can you do about it? What are you doing about it?

The major processor manufacturers and architectures, from Intel and AMD to Sparc and PowerPC, have run out of room with most of their traditional approaches to boosting CPU performance. Instead of driving clock speeds and straight-line instruction throughput ever higher, they are instead turning en masse to hyperthreading and multicore architectures. Both of these features are already available on chips today; in particular, multicore is available on current PowerPC and Sparc IV processors, and is coming in 2005 from Intel and AMD. Indeed, the big theme of the 2004 In-Stat/MDR Fall Processor Forum was multicore devices, as many companies showed new or updated multicore processors. Looking back, it’s not much of a stretch to call 2004 the year of multicore.

And that puts us at a fundamental turning point in software development, at least for the next few years and for applications targeting general-purpose desktop computers and low-end servers (which happens to account for the vast bulk of the dollar value of software sold today). In this article, I’ll describe the changing face of hardware, why it suddenly does matter to software, and how specifically the concurrency revolution matters to you and is going to change the way you will likely be writing software in the future.

Arguably, the free lunch has already been over for a year or two, only we’re just now noticing.

The Free Performance Lunch

There’s an interesting phenomenon that’s known as “Andy giveth, and Bill taketh away.” No matter how fast processors get, software consistently finds new ways to eat up the extra speed. Make a CPU ten times as fast, and software will usually find ten times as much to do (or, in some cases, will feel at liberty to do it ten times less efficiently). Most classes of applications have enjoyed free and regular performance gains for several decades, even without releasing new versions or doing anything special, because the CPU manufacturers (primarily) and memory and disk manufacturers (secondarily) have reliably enabled ever-newer and ever-faster mainstream systems. Clock speed isn’t the only measure of performance, or even necessarily a good one, but it’s an instructive one: We’re used to seeing 500MHz CPUs give way to 1GHz CPUs give way to 2GHz CPUs, and so on. Today we’re in the 3GHz range on mainstream computers.

The key question is: When will it end? After all, Moore’s Law predicts exponential growth, and clearly exponential growth can’t continue forever before we reach hard physical limits; light isn’t getting any faster. The growth must eventually slow down and even end. (Caveat: Yes, Moore’s Law applies principally to transistor densities, but the same kind of exponential growth has occurred in related areas such as clock speeds. There’s even faster growth in other spaces, most notably the data storage explosion, but that important trend belongs in a different article.)

If you’re a software developer, chances are that you have already been riding the “free lunch” wave of desktop computer performance. Is your application’s performance borderline for some local operations? “Not to worry,” the conventional (if suspect) wisdom goes; “tomorrow’s processors will have even more throughput, and anyway today’s applications are increasingly throttled by factors other than CPU throughput and memory speed (e.g., they’re often I/O-bound, network-bound, database-bound).” Right?

Right enough, in the past. But dead wrong for the foreseeable future.

The good news is that processors are going to continue to become more powerful. The bad news is that, at least in the short term, the growth will come mostly in directions that do not take most current applications along for their customary free ride.

Over the past 30 years, CPU designers have achieved performance gains in three main areas, the first two of which focus on straight-line execution flow:

	clock speed
	execution optimization
	cache

Increasing clock speed is about getting more cycles. Running the CPU faster more or less directly means doing the same work faster.

Optimizing execution flow is about doing more work per cycle. Today’s CPUs sport some more powerful instructions, and they perform optimizations that range from the pedestrian to the exotic, including pipelining, branch prediction, executing multiple instructions in the same clock cycle(s), and even reordering the instruction stream for out-of-order execution. These techniques are all designed to make the instructions flow better and/or execute faster, and to squeeze the most work out of each clock cycle by reducing latency and maximizing the work accomplished per clock cycle.

Chip designers are under so much pressure to deliver ever-faster CPUs that they’ll risk changing the meaning of your program, and possibly break it, in order to make it run faster

Brief aside on instruction reordering and memory models: Note that some of what I just called “optimizations” are actually far more than optimizations, in that they can change the meaning of programs and cause visible effects that can break reasonable programmer expectations. This is significant. CPU designers are generally sane and well-adjusted folks who normally wouldn’t hurt a fly, and wouldn’t think of hurting your code… normally. But in recent years they have been willing to pursue aggressive optimizations just to wring yet more speed out of each cycle, even knowing full well that these aggressive rearrangements could endanger the semantics of your code. Is this Mr. Hyde making an appearance? Not at all. That willingness is simply a clear indicator of the extreme pressure the chip designers face to deliver ever-faster CPUs; they’re under so much pressure that they’ll risk changing the meaning of your program, and possibly break it, in order to make it run faster. Two noteworthy examples in this respect are write reordering and read reordering: Allowing a processor to reorder write operations has consequences that are so surprising, and break so many programmer expectations, that the feature generally has to be turned off because it’s too difficult for programmers to reason correctly about the meaning of their programs in the presence of arbitrary write reordering. Reordering read operations can also yield surprising visible effects, but that is more commonly left enabled anyway because it isn’t quite as hard on programmers, and the demands for performance cause designers of operating systems and operating environments to compromise and choose models that place a greater burden on programmers because that is viewed as a lesser evil than giving up the optimization opportunities.

Finally, increasing the size of on-chip cache is about staying away from RAM. Main memory continues to be so much slower than the CPU that it makes sense to put the data closer to the processor—and you can’t get much closer than being right on the die. On-die cache sizes have soared, and today most major chip vendors will sell you CPUs that have 2MB and more of on-board L2 cache. (Of these three major historical approaches to boosting CPU performance, increasing cache is the only one that will continue in the near term. I’ll talk a little more about the importance of cache later on.)

Okay. So what does this mean?

A fundamentally important thing to recognize about this list is that all of these areas are concurrency-agnostic. Speedups in any of these areas will directly lead to speedups in sequential (nonparallel, single-threaded, single-process) applications, as well as applications that do make use of concurrency. That’s important, because the vast majority of today’s applications are single-threaded, for good reasons that I’ll get into further below.

Of course, compilers have had to keep up; sometimes you need to recompile your application, and target a specific minimum level of CPU, in order to benefit from new instructions (e.g., MMX, SSE) and some new CPU features and characteristics. But, by and large, even old applications have always run significantly faster—even without being recompiled to take advantage of all the new instructions and features offered by the latest CPUs.

That world was a nice place to be. Unfortunately, it has already disappeared.

Obstacles, and Why You Don’t Have 10GHz Today

CPU performance growth as we have known it hit a wall two years ago. Most people have only recently started to notice.

You can get similar graphs for other chips, but I’m going to use Intel data here. Figure 1 graphs the history of Intel chip introductions by clock speed and number of transistors. The number of transistors continues to climb, at least for now. Clock speed, however, is a different story.

Figure 1: Intel CPU Introductions (graph updated August 2009; article text original from December 2004)

Around the beginning of 2003, you’ll note a disturbing sharp turn in the previous trend toward ever-faster CPU clock speeds. I’ve added lines to show the limit trends in maximum clock speed; instead of continuing on the previous path, as indicated by the thin dotted line, there is a sharp flattening. It has become harder and harder to exploit higher clock speeds due to not just one but several physical issues, notably heat (too much of it and too hard to dissipate), power consumption (too high), and current leakage problems.

Quick: What’s the clock speed on the CPU(s) in your current workstation? Are you running at 10GHz? On Intel chips, we reached 2GHz a long time ago (August 2001), and according to CPU trends before 2003, now in early 2005 we should have the first 10GHz Pentium-family chips. A quick look around shows that, well, actually, we don’t. What’s more, such chips are not even on the horizon—we have no good idea at all about when we might see them appear.

Well, then, what about 4GHz? We’re at 3.4GHz already—surely 4GHz can’t be far away? Alas, even 4GHz seems to be remote indeed. In mid-2004, as you probably know, Intel first delayed its planned introduction of a 4GHz chip until 2005, and then in fall 2004 it officially abandoned its 4GHz plans entirely. As of this writing, Intel is planning to ramp up a little further to 3.73GHz in early 2005 (already included in Figure 1 as the upper-right-most dot), but the clock race really is over, at least for now; Intel’s and most processor vendors’ future lies elsewhere as chip companies aggressively pursue the same new multicore directions.

We’ll probably see 4GHz CPUs in our mainstream desktop machines someday, but it won’t be in 2005. Sure, Intel has samples of their chips running at even higher speeds in the lab—but only by heroic efforts, such as attaching hideously impractical quantities of cooling equipment. You won’t have that kind of cooling hardware in your office any day soon, let alone on your lap while computing on the plane.

TANSTAAFL: Moore’s Law and the Next Generation(s)

“There ain’t no such thing as a free lunch.” —R. A. Heinlein, The Moon Is a Harsh Mistress

Does this mean Moore’s Law is over? Interestingly, the answer in general seems to be no. Of course, like all exponential progressions, Moore’s Law must end someday, but it does not seem to be in danger for a few more years yet. Despite the wall that chip engineers have hit in juicing up raw clock cycles, transistor counts continue to explode and it seems CPUs will continue to follow Moore’s Law-like throughput gains for some years to come.

Myths and Realities: 2 x 3GHz < 6 GHz

So a dual-core CPU that combines two 3GHz cores practically offers 6GHz of processing power. Right?

Wrong. Even having two threads running on two physical processors doesn’t mean getting two times the performance. Similarly, most multi-threaded applications won’t run twice as fast on a dual-core box. They should run faster than on a single-core CPU; the performance gain just isn’t linear, that’s all.

Why not? First, there is coordination overhead between the cores to ensure cache coherency (a consistent view of cache, and of main memory) and to perform other handshaking. Today, a two- or four-processor machine isn’t really two or four times as fast as a single CPU even for multi-threaded applications. The problem remains essentially the same even when the CPUs in question sit on the same die.

Second, unless the two cores are running different processes, or different threads of a single process that are well-written to run independently and almost never wait for each other, they won’t be well utilized. (Despite this, I will speculate that today’s single-threaded applications as actually used in the field could actually see a performance boost for most users by going to a dual-core chip, not because the extra core is actually doing anything useful, but because it is running the adware and spyware that infest many users’ systems and are otherwise slowing down the single CPU that user has today. I leave it up to you to decide whether adding a CPU to run your spyware is the best solution to that problem.)

If you’re running a single-threaded application, then the application can only make use of one core. There should be some speedup as the operating system and the application can run on separate cores, but typically the OS isn’t going to be maxing out the CPU anyway so one of the cores will be mostly idle. (Again, the spyware can share the OS’s core most of the time.)

The key difference, which is the heart of this article, is that the performance gains are going to be accomplished in fundamentally different ways for at least the next couple of processor generations. And most current applications will no longer benefit from the free ride without significant redesign.

For the near-term future, meaning for the next few years, the performance gains in new chips will be fueled by three main approaches, only one of which is the same as in the past. The near-term future performance growth drivers are:

	hyperthreading
	multicore
	cache

Hyperthreading is about running two or more threads in parallel inside a single CPU. Hyperthreaded CPUs are already available today, and they do allow some instructions to run in parallel. A limiting factor, however, is that although a hyper-threaded CPU has some extra hardware including extra registers, it still has just one cache, one integer math unit, one FPU, and in general just one each of most basic CPU features. Hyperthreading is sometimes cited as offering a 5% to 15% performance boost for reasonably well-written multi-threaded applications, or even as much as 40% under ideal conditions for carefully written multi-threaded applications. That’s good, but it’s hardly double, and it doesn’t help single-threaded applications.

Multicore is about running two or more actual CPUs on one chip. Some chips, including Sparc and PowerPC, have multicore versions available already. The initial Intel and AMD designs, both due in 2005, vary in their level of integration but are functionally similar. AMD’s seems to have some initial performance design advantages, such as better integration of support functions on the same die, whereas Intel’s initial entry basically just glues together two Xeons on a single die. The performance gains should initially be about the same as having a true dual-CPU system (only the system will be cheaper because the motherboard doesn’t have to have two sockets and associated “glue” chippery), which means something less than double the speed even in the ideal case, and just like today it will boost reasonably well-written multi-threaded applications. Not single-threaded ones.

Finally, on-die cache sizes can be expected to continue to grow, at least in the near term. Of these three areas, only this one will broadly benefit most existing applications. The continuing growth in on-die cache sizes is an incredibly important and highly applicable benefit for many applications, simply because space is speed. Accessing main memory is expensive, and you really don’t want to touch RAM if you can help it. On today’s systems, a cache miss that goes out to main memory often costs 10 to 50 times as much getting the information from the cache; this, incidentally, continues to surprise people because we all think of memory as fast, and it is fast compared to disks and networks, but not compared to on-board cache which runs at faster speeds. If an application’s working set fits into cache, we’re golden, and if it doesn’t, we’re not. That is why increased cache sizes will save some existing applications and breathe life into them for a few more years without requiring significant redesign: As existing applications manipulate more and more data, and as they are incrementally updated to include more code for new features, performance-sensitive operations need to continue to fit into cache. As the Depression-era old-timers will be quick to remind you, “Cache is king.”

(Aside: Here’s an anecdote to demonstrate “space is speed” that recently hit my compiler team. The compiler uses the same source base for the 32-bit and 64-bit compilers; the code is just compiled as either a 32-bit process or a 64-bit one. The 64-bit compiler gained a great deal of baseline performance by running on a 64-bit CPU, principally because the 64-bit CPU had many more registers to work with and had other code performance features. All well and good. But what about data? Going to 64 bits didn’t change the size of most of the data in memory, except that of course pointers in particular were now twice the size they were before. As it happens, our compiler uses pointers much more heavily in its internal data structures than most other kinds of applications ever would. Because pointers were now 8 bytes instead of 4 bytes, a pure data size increase, we saw a significant increase in the 64-bit compiler’s working set. That bigger working set caused a performance penalty that almost exactly offset the code execution performance increase we’d gained from going to the faster processor with more registers. As of this writing, the 64-bit compiler runs at the same speed as the 32-bit compiler, even though the source base is the same for both and the 64-bit processor offers better raw processing throughput. Space is speed.)

But cache is it. Hyperthreading and multicore CPUs will have nearly no impact on most current applications.

So what does this change in the hardware mean for the way we write software? By now you’ve probably noticed the basic answer, so let’s consider it and its consequences.

What This Means For Software: The Next Revolution

In the 1990s, we learned to grok objects. The revolution in mainstream software development from structured programming to object-oriented programming was the greatest such change in the past 20 years, and arguably in the past 30 years. There have been other changes, including the most recent (and genuinely interesting) naissance of web services, but nothing that most of us have seen during our careers has been as fundamental and as far-reaching a change in the way we write software as the object revolution.

Until now.

Starting today, the performance lunch isn’t free any more. Sure, there will continue to be generally applicable performance gains that everyone can pick up, thanks mainly to cache size improvements. But if you want your application to benefit from the continued exponential throughput advances in new processors, it will need to be a well-written concurrent (usually multithreaded) application. And that’s easier said than done, because not all problems are inherently parallelizable and because concurrent programming is hard.

I can hear the howls of protest: “Concurrency? That’s not news! People are already writing concurrent applications.” That’s true. Of a small fraction of developers.

Remember that people have been doing object-oriented programming since at least the days of Simula in the late 1960s. But OO didn’t become a revolution, and dominant in the mainstream, until the 1990s. Why then? The reason the revolution happened was primarily that our industry was driven by requirements to write larger and larger systems that solved larger and larger problems and exploited the greater and greater CPU and storage resources that were becoming available. OOP’s strengths in abstraction and dependency management made it a necessity for achieving large-scale software development that is economical, reliable, and repeatable.

Concurrency is the next major revolution in how we write software

Similarly, we’ve been doing concurrent programming since those same dark ages, writing coroutines and monitors and similar jazzy stuff. And for the past decade or so we’ve witnessed incrementally more and more programmers writing concurrent (multi-threaded, multi-process) systems. But an actual revolution marked by a major turning point toward concurrency has been slow to materialize. Today the vast majority of applications are single-threaded, and for good reasons that I’ll summarize in the next section.

By the way, on the matter of hype: People have always been quick to announce “the next software development revolution,” usually about their own brand-new technology. Don’t believe it. New technologies are often genuinely interesting and sometimes beneficial, but the biggest revolutions in the way we write software generally come from technologies that have already been around for some years and have already experienced gradual growth before they transition to explosive growth. This is necessary: You can only base a software development revolution on a technology that’s mature enough to build on (including having solid vendor and tool support), and it generally takes any new software technology at least seven years before it’s solid enough to be broadly usable without performance cliffs and other gotchas. As a result, true software development revolutions like OO happen around technologies that have already been undergoing refinement for years, often decades. Even in Hollywood, most genuine “overnight successes” have really been performing for many years before their big break.

Concurrency is the next major revolution in how we write software. Different experts still have different opinions on whether it will be bigger than OO, but that kind of conversation is best left to pundits. For technologists, the interesting thing is that concurrency is of the same order as OO both in the (expected) scale of the revolution and in the complexity and learning curve of the technology.

Benefits and Costs of Concurrency

There are two major reasons for which concurrency, especially multithreading, is already used in mainstream software. The first is to logically separate naturally independent control flows; for example, in a database replication server I designed it was natural to put each replication session on its own thread, because each session worked completely independently of any others that might be active (as long as they weren’t working on the same database row). The second and less common reason to write concurrent code in the past has been for performance, either to scalably take advantage of multiple physical CPUs or to easily take advantage of latency in other parts of the application; in my database replication server, this factor applied as well and the separate threads were able to scale well on multiple CPUs as our server handled more and more concurrent replication sessions with many other servers.

There are, however, real costs to concurrency. Some of the obvious costs are actually relatively unimportant. For example, yes, locks can be expensive to acquire, but when used judiciously and properly you gain much more from the concurrent execution than you lose on the synchronization, if you can find a sensible way to parallelize the operation and minimize or eliminate shared state.

Perhaps the second-greatest cost of concurrency is that not all applications are amenable to parallelization. I’ll say more about this later on.

Probably the greatest cost of concurrency is that concurrency really is hard: The programming model, meaning the model in the programmer’s head that he needs to reason reliably about his program, is much harder than it is for sequential control flow.

Everybody who learns concurrency thinks they understand it, ends up finding mysterious races they thought weren’t possible, and discovers that they didn’t actually understand it yet after all. As the developer learns to reason about concurrency, they find that usually those races can be caught by reasonable in-house testing, and they reach a new plateau of knowledge and comfort. What usually doesn’t get caught in testing, however, except in shops that understand why and how to do real stress testing, is those latent concurrency bugs that surface only on true multiprocessor systems, where the threads aren’t just being switched around on a single processor but where they really do execute truly simultaneously and thus expose new classes of errors. This is the next jolt for people who thought that surely now they know how to write concurrent code: I’ve come across many teams whose application worked fine even under heavy and extended stress testing, and ran perfectly at many customer sites, until the day that a customer actually had a real multiprocessor machine and then deeply mysterious races and corruptions started to manifest intermittently. In the context of today’s CPU landscape, then, redesigning your application to run multithreaded on a multicore machine is a little like learning to swim by jumping into the deep end—going straight to the least forgiving, truly parallel environment that is most likely to expose the things you got wrong. Even when you have a team that can reliably write safe concurrent code, there are other pitfalls; for example, concurrent code that is completely safe but isn’t any faster than it was on a single-core machine, typically because the threads aren’t independent enough and share a dependency on a single resource which re-serializes the program’s execution. This stuff gets pretty subtle.

The vast majority of programmers today don’t grok concurrency, just as the vast majority of programmers 15 years ago didn’t yet grok objects

Just as it is a leap for a structured programmer to learn OO (what’s an object? what’s a virtual function? how should I use inheritance? and beyond the “whats” and “hows,” why are the correct design practices actually correct?), it’s a leap of about the same magnitude for a sequential programmer to learn concurrency (what’s a race? what’s a deadlock? how can it come up, and how do I avoid it? what constructs actually serialize the program that I thought was parallel? how is the message queue my friend? and beyond the “whats” and “hows,” why are the correct design practices actually correct?).

The vast majority of programmers today don’t grok concurrency, just as the vast majority of programmers 15 years ago didn’t yet grok objects. But the concurrent programming model is learnable, particularly if we stick to message- and lock-based programming, and once grokked it isn’t that much harder than OO and hopefully can become just as natural. Just be ready and allow for the investment in training and time, for you and for your team.

(I deliberately limit the above to message- and lock-based concurrent programming models. There is also lock-free programming, supported most directly at the language level in Java 5 and in at least one popular C++ compiler. But concurrent lock-free programming is known to be very much harder for programmers to understand and reason about than even concurrent lock-based programming. Most of the time, only systems and library writers should have to understand lock-free programming, although virtually everybody should be able to take advantage of the lock-free systems and libraries those people produce. Frankly, even lock-based programming is hazardous.)

What It Means For Us

Okay, back to what it means for us.

1. The clear primary consequence we’ve already covered is that applications will increasingly need to be concurrent if they want to fully exploit CPU throughput gains that have now started becoming available and will continue to materialize over the next several years. For example, Intel is talking about someday producing 100-core chips; a single-threaded application can exploit at most 1/100 of such a chip’s potential throughput. “Oh, performance doesn’t matter so much, computers just keep getting faster” has always been a naïve statement to be viewed with suspicion, and for the near future it will almost always be simply wrong.

Applications will increasingly need to be concurrent if they want to fully exploit continuing exponential CPU throughput gains

Efficiency and performance optimization will get more, not less, important

Now, not all applications (or, more precisely, important operations of an application) are amenable to parallelization. True, some problems, such as compilation, are almost ideally parallelizable. But others aren’t; the usual counterexample here is that just because it takes one woman nine months to produce a baby doesn’t imply that nine women could produce one baby in one month. You’ve probably come across that analogy before. But did you notice the problem with leaving the analogy at that? Here’s the trick question to ask the next person who uses it on you: Can you conclude from this that the Human Baby Problem is inherently not amenable to parallelization? Usually people relating this analogy err in quickly concluding that it demonstrates an inherently nonparallel problem, but that’s actually not necessarily correct at all. It is indeed an inherently nonparallel problem if the goal is to produce one child. It is actually an ideally parallelizable problem if the goal is to produce many children! Knowing the real goals can make all the difference. This basic goal-oriented principle is something to keep in mind when considering whether and how to parallelize your software.

2. Perhaps a less obvious consequence is that applications are likely to become increasingly CPU-bound. Of course, not every application operation will be CPU-bound, and even those that will be affected won’t become CPU-bound overnight if they aren’t already, but we seem to have reached the end of the “applications are increasingly I/O-bound or network-bound or database-bound” trend, because performance in those areas is still improving rapidly (gigabit WiFi, anyone?) while traditional CPU performance-enhancing techniques have maxed out. Consider: We’re stopping in the 3GHz range for now. Therefore single-threaded programs are likely not to get much faster any more for now except for benefits from further cache size growth (which is the main good news). Other gains are likely to be incremental and much smaller than we’ve been used to seeing in the past, for example as chip designers find new ways to keep pipelines full and avoid stalls, which are areas where the low-hanging fruit has already been harvested. The demand for new application features is unlikely to abate, and even more so the demand to handle vastly growing quantities of application data is unlikely to stop accelerating. As we continue to demand that programs do more, they will increasingly often find that they run out of CPU to do it unless they can code for concurrency.

There are two ways to deal with this sea change toward concurrency. One is to redesign your applications for concurrency, as above. The other is to be frugal, by writing code that is more efficient and less wasteful. This leads to the third interesting consequence:

3. Efficiency and performance optimization will get more, not less, important. Those languages that already lend themselves to heavy optimization will find new life; those that don’t will need to find ways to compete and become more efficient and optimizable. Expect long-term increased demand for performance-oriented languages and systems.

4. Finally, programming languages and systems will increasingly be forced to deal well with concurrency. The Java language has included support for concurrency since its beginning, although mistakes were made that later had to be corrected over several releases in order to do concurrent programming more correctly and efficiently. The C++ language has long been used to write heavy-duty multithreaded systems well, but it has no standardized support for concurrency at all (the ISO C++ standard doesn’t even mention threads, and does so intentionally), and so typically the concurrency is of necessity accomplished by using nonportable platform-specific concurrency features and libraries. (It’s also often incomplete; for example, static variables must be initialized only once, which typically requires that the compiler wrap them with a lock, but many C++ implementations do not generate the lock.) Finally, there are a few concurrency standards, including pthreads and OpenMP, and some of these support implicit as well as explicit parallelization. Having the compiler look at your single-threaded program and automatically figure out how to parallelize it implicitly is fine and dandy, but those automatic transformation tools are limited and don’t yield nearly the gains of explicit concurrency control that you code yourself. The mainstream state of the art revolves around lock-based programming, which is subtle and hazardous. We desperately need a higher-level programming model for concurrency than languages offer today; I'll have more to say about that soon.

Conclusion

If you haven’t done so already, now is the time to take a hard look at the design of your application, determine what operations are CPU-sensitive now or are likely to become so soon, and identify how those places could benefit from concurrency. Now is also the time for you and your team to grok concurrent programming’s requirements, pitfalls, styles, and idioms.

A few rare classes of applications are naturally parallelizable, but most aren’t. Even when you know exactly where you’re CPU-bound, you may well find it difficult to figure out how to parallelize those operations; all the most reason to start thinking about it now. Implicitly parallelizing compilers can help a little, but don’t expect much; they can’t do nearly as good a job of parallelizing your sequential program as you could do by turning it into an explicitly parallel and threaded version.

Thanks to continued cache growth and probably a few more incremental straight-line control flow optimizations, the free lunch will continue a little while longer; but starting today the buffet will only be serving that one entrée and that one dessert. The filet mignon of throughput gains is still on the menu, but now it costs extra—extra development effort, extra code complexity, and extra testing effort. The good news is that for many classes of applications the extra effort will be worthwhile, because concurrency will let them fully exploit the continuing exponential gains in processor throughput.

Tuesday, August 11, 2015

Database versioning best practices

http://enterprisecraftsmanship.com/2015/08/10/database-versioning-best-practices/

Database versioning best practices

By Vladimir Khorikov

Keeping track of your application’s database is not an easy task. Database schemas tend to mismatch in different environments, data in one of the databases may miss some crucial piece of data. Such occasions can be irritating, especially when caught in production.
The situation gets worse when you develop redistributable software. In this case, each of your clients has their own database instance whose structure may differ from others’. In such a project, keeping track of your clients’ databases can become a nightmare.
Let’s look at the database versioning best practices that help us deal with this problem.

Database versioning: the problem

When you are a single programmer working on a project that is not yet shipped to production, there is no such problem as database versioning. You just change your DB schema the way you want and it always works.
Problems arise when your software starts operating in production or a new team member joins you to work on database-related parts of your project. As soon as you have more than one database instance, they start getting out of sync. Even with a single instance, it takes a significant amount of time to synchronize the changes when more than one developer work with it.
Does it look familiar? I bet you were in such situations, probably more than once. I certainly was.

Database versioning best practices

Fortunately, we are not alone. There are plenty of materials written on that topic as well as software that is aimed to solve this problem. I recommend this book if you want to dive deeper into the subject. It is an ultimate guideline for how to evolve your database along with the code that uses it.
Alright, so what are these database versioning best practices?
Best practice #1: we need to treat the application database and the reference data in it as regular code. That means we should store both its schema and the reference data in a source control system.
Note that this rule includes not only schema of the database but also the reference data in it. Reference data is the data that is mandatory to run the application. For example, if you have a dictionary of all customer types possible on which existence your application relies, you should store it in the source control system as well.
Best practice #2: we have to store every change in the database schema and in the reference data explicitly. This means that for every modification we make we should create a separate SQL script with the changes. If the modification affects both the schema and the reference data, they should be reflected in a single script.
Database versioning best practices

Adhering to this rule is a vital part of building a successful database versioning system. Many projects have their database schema stored in a source control but often it is just a snapshot of the latest database version whatever that is. All the changes in it are tracked by the source control system itself, they are not stored explicitly. Moreover, often the changes in the reference data are not tracked at all.
Such tools as Visual Studio database project emphasize that approach and urge programmers to use auto-generated upgrade scripts for schema update. While this may work well in small projects, in larger projects, tracking changes in the database using auto-generated scripts becomes a burden. We will talk about Visual Studio database project and other tools available in the next post.
Best practice #3: every SQL script file must be immutable after it is deployed to production or staging environment.
The whole point of storing the changes in separate files is to be able to track each of them. When we modify the existing SQL scripts we lose all the benefits the database versioning best practices provide us. Keep the script files unchangeable after their deployment. If you need to turn down the changes that are already shipped – create a separate script for that.
Best practice #4: all changes in the database’s schema and reference data have to be applied through the scripts. Neither of them can be applied manually.
If we modify the database passing over our scripts, the whole idea of database versioning becomes worthless, so we need to make sure the changes are made only via the SQL scripts we create.
Best practice #5: every developer in the team should have their own database instance.
Often, teams start with a single database in the developer environment. That works well at the beginning but when the database grows large enough, simultaneous modifications of it become harder and harder until at some point stops working at all.
Changes programmers make are often incompatible so it’s a good idea for each programmer to have a separate DB instance to avoid such collisions. If developers do modify related pieces of the DB schema simultaneously, such conflicts can be resolved using a source control system, just like the conflicts in C#/Java/etc code.
Moreover, if you have several branches of your code base, you might also want to create a separate DB instance for each of them, depending on how different the databases in these branches are.
Best practice #6: database version should be stored in the database itself. I usually tend to create a separate table named Settings and keep the version there. Don’t use complex notations like “x.y.z” for the version number, just use a single integer.

What are the benefits of such approach?

So what benefits these database versioning best practices give us?
The first and the most important advantage is that when we use this approach, we don’t have the problems with the database schema mismatch anymore. Automatic upgrades to the latest version solve them completely, of course if we fully adhere to the rules described above.
This is especially useful when you don’t have a single production database, but every client has their own DB instance. Managing DB versions in such circumstances might become hell if you don’t employ proper versioning techniques.
Another gain these best practices provide is a high cohesion of the database changes. It means that every notable modification in the schema and the reference data is reflected in a single place and not spread across the application.
The SQL upgrade scripts also grand high cohesion is a sense that they contain every DB change required for a feature, so it’s easy to understand what modifications were made in the database in order to unlock a particular functionality. Keeping both schema and data changes related to each other in a single file also helps a lot.
The approach described in this post is applicable even if you didn’t follow it from the very beginning. To put it into practice, you just need to create an initial script with the database schema you have right now in production and start changing it incrementally from that moment. The current version should become version #1 from which you can move further using the techniques we discussed above.

Summary

In this article, we looked at the database versioning best practices. In the next post, we’ll see what software are there at our disposal. I’ll also show a lightweight tool I use for applying SQL upgrade scripts.

Sunday, August 2, 2015

When TDD Doesn't Matter

https://www.facebook.com/notes/kent-beck/when-tdd-doesnt-matter/797644973601702

A group of students apologized for "not using TDD" today. It was like they were apologizing to their dentist for not flossing--they should do it but just weren't motivated to take the trouble. In that moment I realized that TDD is small part of a large and complicated space. As long as the students mindfully chose where they wanted to be in the space, I didn't really care where they ended up.

Here's the explanation I invented. I use three dimensions to characterize getting feedback for programming decisions. I think these three are the most important ones to consider, plus visualizing in three dimensions is (just barely) possible for me. Here they are:

One dimension to choose when seeking feedback is how much scope you are interested in. If you try to take in absolutely every potential effect of a programming decision, you would have to analyze its global economic and social impact. Scope trades off between the tactical utility of feedback and leverage. Knowing you have a syntax error means you definitely need to make changes, knowing that there might be adverse effects to society as a whole because of a decision doesn't help minute-to-minute even if it's very important.

The second dimension characterizing programming feedback is how clear you want to be about the consequences of your decisions. Sometimes you just want to wait and see what happens, sometimes you want to predict up front exactly what you are going to see and if you see anything different at all you want to know about it unambiguously. Clarity trades off between the effort needed to specify expectations and the information produced when you have actual data to compare to the expectations.

Scope x Clarity

For TDD, I chose to get feedback at the next level above the compiler. The compiler tells you whether it's worthwhile trying to run a program. The tests (potentially) provide much more feedback about whether a program is "good". A secondary effect of writing tests is that you can double-check my programming work. If you get to the same answer two different ways, you can have a fair amount of confidence that the code behaves as expected. Feedback with larger scope takes more work to gather, so test-level feedback is a reasonable tradeoff.

On the clarity axis, for TDD I chose the full monty--binary feedback, red or green, good or trash. Not all feedback can be cast in binary terms. Reduced engagement in one part of a program can be compensated by increased engagement in another part. Analyzing subtle tradeoffs take time, though, and I was looking for feedback that wouldn't require much effort to analyze, to avoid interrupting programming flow. The combination of these two choices--complete clarity and test-level scope--defines what is conventionally called unit and integration tests:

The final dimension along which to choose feedback is frequency. Fortunately this can measured in a tidy linear way (even if the effects of delay are non-linear) from years to seconds (and milliseconds if you listen to Gary Bernhardt, which I do).

TDD = Frequency:seconds * Scope:tests * Clarity:binary

TDD is a little box in this big space of feedback, the combination of expressing binary expectations, expressing them as tests, and receiving feedback every few seconds about whether those expectations are being met. This seems to me to be a sweet spot in the feedback space. You don't get perfect information, but you get pretty good information at a reasonable cost and quickly enough that you can a) fix problems quickly and b) learn not to make the same mistake again.

Consequences

Expressing feedback as a space suggests experiments. What if you relaxed the frequency dimension? Would the additional scope you could cover provide enough value to make up for the reduced timeliness? What new mistakes could you catch if you reduced clarity and increased scope? How frequently should such feedback be considered? If you increase the modularity of the system, how much could you increase frequency? How long would it take for the investment in design be paid for (if ever)?

The feedback space illustrates one of the dangers of feedback, putting the response loop inside the measurement loop. This happens when businesses manage quarter-to-quarter even though the effects of decisions aren't known for years. Technically, if you got feedback about server load every four minutes, you would be nuts to change server configuration every minute. You're just going to cause oscillation. We can label parts of the feedback space "here abide dragons".

Now I can explain when TDD doesn't matter. When the students understand the shape of feedback space, when they understand the tradeoffs involved in moving along each dimension, and when they understand the interaction of the dimensions, then I don't care if they "do" TDD or not. I'd rather focus on teaching them the principles than policing whether they are aping one particular ritual. That's when TDD doesn't matter.

History's Worst Software Bugs

http://archive.wired.com/software/coolapps/news/2005/11/69355?currentPage=all

History's Worst Software Bugs

Simson Garfinkel

11.08.05

Last month automaker Toyota announced a recall of 160,000 of its Prius hybrid vehicles following reports of vehicle warning lights illuminating for no reason, and cars' gasoline engines stalling unexpectedly. But unlike the large-scale auto recalls of years past, the root of the Prius issue wasn't a hardware problem -- it was a programming error in the smart car's embedded code. The Prius had a software bug.
With that recall, the Prius joined the ranks of the buggy computer -- a club that began in 1945 when engineers found a moth in Panel F, Relay #70 of the Harvard Mark II system.The computer was running a test of its multiplier and adder when the engineers noticed something was wrong. The moth was trapped, removed and taped into the computer's logbook with the words: "first actual case of a bug being found."
Sixty years later, computer bugs are still with us, and show no sign of going extinct. As the line between software and hardware blurs, coding errors are increasingly playing tricks on our daily lives. Bugs don't just inhabit our operating systems and applications -- today they lurk within our cell phones and our pacemakers, our power plants and medical equipment. And now, in our cars.
But which are the worst?
It's all too easy to come up with a list of bugs that have wreaked havoc. It's harder to rate their severity. Which is worse -- a security vulnerability that's exploited by a computer worm to shut down the internet for a few days or a typo that triggers a day-long crash of the nation's phone system? The answer depends on whether you want to make a phone call or check your e-mail.
Many people believe the worst bugs are those that cause fatalities. To be sure, there haven't been many, but cases like the Therac-25 are widely seen as warnings against the widespread deployment of software in safety critical applications. Experts who study such systems, though, warn that even though the software might kill a few people, focusing on these fatalities risks inhibiting the migration of technology into areas where smarter processing is sorely needed. In the end, they say, the lack of software might kill more people than the inevitable bugs.
What seems certain is that bugs are here to stay. Here, in chronological order, is the Wired News list of the 10 worst software bugs of all time … so far.
July 28, 1962 -- Mariner I space probe. A bug in the flight software for the Mariner 1 causes the rocket to divert from its intended path on launch. Mission control destroys the rocket over the Atlantic Ocean. The investigation into the accident discovers that a formula written on paper in pencil was improperly transcribed into computer code, causing the computer to miscalculate the rocket's trajectory.
1982 -- Soviet gas pipeline. Operatives working for the Central Intelligence Agency allegedly (.pdf) plant a bug in a Canadian computer system purchased to control the trans-Siberian gas pipeline. The Soviets had obtained the system as part of a wide-ranging effort to covertly purchase or steal sensitive U.S. technology. The CIA reportedly found out about the program and decided to make it backfire with equipment that would pass Soviet inspection and then fail once in operation. The resulting event is reportedly the largest non-nuclear explosion in the planet's history.
1985-1987 -- Therac-25 medical accelerator. A radiation therapy device malfunctions and delivers lethal radiation doses at several medical facilities. Based upon a previous design, the Therac-25 was an "improved" therapy system that could deliver two different kinds of radiation: either a low-power electron beam (beta particles) or X-rays. The Therac-25's X-rays were generated by smashing high-power electrons into a metal target positioned between the electron gun and the patient. A second "improvement" was the replacement of the older Therac-20's electromechanical safety interlocks with software control, a decision made because software was perceived to be more reliable.
What engineers didn't know was that both the 20 and the 25 were built upon an operating system that had been kludged together by a programmer with no formal training. Because of a subtle bug called a "race condition," a quick-fingered typist could accidentally configure the Therac-25 so the electron beam would fire in high-power mode but with the metal X-ray target out of position. At least five patients die; others are seriously injured.
1988 -- Buffer overflow in Berkeley Unix finger daemon. The first internet worm (the so-called Morris Worm) infects between 2,000 and 6,000 computers in less than a day by taking advantage of a buffer overflow. The specific code is a function in the standard input/output library routine called gets() designed to get a line of text over the network. Unfortunately, gets() has no provision to limit its input, and an overly large input allows the worm to take over any machine to which it can connect.
Programmers respond by attempting to stamp out the gets() function in working code, but they refuse to remove it from the C programming language's standard input/output library, where it remains to this day.
1988-1996 -- Kerberos Random Number Generator. The authors of the Kerberos security system neglect to properly "seed" the program's random number generator with a truly random seed. As a result, for eight years it is possible to trivially break into any computer that relies on Kerberos for authentication. It is unknown if this bug was ever actually exploited.
January 15, 1990 -- AT&T Network Outage. A bug in a new release of the software that controls AT&T's #4ESS long distance switches causes these mammoth computers to crash when they receive a specific message from one of their neighboring machines -- a message that the neighbors send out when they recover from a crash.
One day a switch in New York crashes and reboots, causing its neighboring switches to crash, then their neighbors' neighbors, and so on. Soon, 114 switches are crashing and rebooting every six seconds, leaving an estimated 60 thousand people without long distance service for nine hours. The fix: engineers load the previous software release.
1993 -- Intel Pentium floating point divide. A silicon error causes Intel's highly promoted Pentium chip to make mistakes when dividing floating-point numbers that occur within a specific range. For example, dividing 4195835.0/3145727.0 yields 1.33374 instead of 1.33382, an error of 0.006 percent. Although the bug affects few users, it becomes a public relations nightmare. With an estimated 3 million to 5 million defective chips in circulation, at first Intel only offers to replace Pentium chips for consumers who can prove that they need high accuracy; eventually the company relents and agrees to replace the chips for anyone who complains. The bug ultimately costs Intel $475 million.
1995/1996 -- The Ping of Death. A lack of sanity checks and error handling in the IP fragmentation reassembly code makes it possible to crash a wide variety of operating systems by sending a malformed "ping" packet from anywhere on the internet. Most obviously affected are computers running Windows, which lock up and display the so-called "blue screen of death" when they receive these packets. But the attack also affects many Macintosh and Unix systems as well.
June 4, 1996 -- Ariane 5 Flight 501. Working code for the Ariane 4 rocket is reused in the Ariane 5, but the Ariane 5's faster engines trigger a bug in an arithmetic routine inside the rocket's flight computer. The error is in the code that converts a 64-bit floating-point number to a 16-bit signed integer. The faster engines cause the 64-bit numbers to be larger in the Ariane 5 than in the Ariane 4, triggering an overflow condition that results in the flight computer crashing.
First Flight 501's backup computer crashes, followed 0.05 seconds later by a crash of the primary computer. As a result of these crashed computers, the rocket's primary processor overpowers the rocket's engines and causes the rocket to disintegrate 40 seconds after launch.
November 2000 -- National Cancer Institute, Panama City. In a series of accidents, therapy planning software created by Multidata Systems International, a U.S. firm, miscalculates the proper dosage of radiation for patients undergoing radiation therapy.
Multidata's software allows a radiation therapist to draw on a computer screen the placement of metal shields called "blocks" designed to protect healthy tissue from the radiation. But the software will only allow technicians to use four shielding blocks, and the Panamanian doctors wish to use five.
The doctors discover that they can trick the software by drawing all five blocks as a single large block with a hole in the middle. What the doctors don't realize is that the Multidata software gives different answers in this configuration depending on how the hole is drawn: draw it in one direction and the correct dose is calculated, draw in another direction and the software recommends twice the necessary exposure.
At least eight patients die, while another 20 receive overdoses likely to cause significant health problems. The physicians, who were legally required to double-check the computer's calculations by hand, are indicted for murder.

Friday, August 21, 2015

The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software

The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software

The Free Performance Lunch

Obstacles, and Why You Don’t Have 10GHz Today

TANSTAAFL: Moore’s Law and the Next Generation(s)

Myths and Realities: 2 x 3GHz < 6 GHz

What This Means For Software: The Next Revolution

Benefits and Costs of Concurrency

What It Means For Us

Conclusion

Tuesday, August 11, 2015

Database versioning best practices

Database versioning: the problem

Database versioning best practices

What are the benefits of such approach?

Summary

Sunday, August 2, 2015

When TDD Doesn't Matter

History's Worst Software Bugs

History's Worst Software Bugs

The Free Lunch Is Over
A Fundamental Turn Toward Concurrency in Software