Basics, Simplicity, Speed, Success
Twenty years ago I wrote this:
From: Rico Mariani
Sent: Wednesday, October 15, 2003 12:51 PM
To: Bill Gates; Jim Allchin; Steve Ballmer; Eric Rudder
Cc: [Other Names Elided]
Subject: Basics, Speed, Simplicity, Success
Overview
I’m tired of people using managed code as the excuse for their performance problems.
A core MS value, tight frugal coding, is gone or at least vastly diminished in Microsoft’s managed code development practices. New Longhorn platform code is instead bursting with “second-system-syndrome” fat and complexity that is generally only incidentally related to the managed nature of the code. “Hard core” programming is all but gone. Despite continuing efforts at education (see e.g. http://blog/ricom), managed code is still the oft-played excuse.
We’ve built our greatest successes not when we had grandiose plans for some platform evolution, but rather when we specifically targeted the needs of our customers. Simplifying the experience for our customers, so that they get fast, reliable, predictable, and trustworthy behavior, needs to be our foremost goal. The goal of creating a new rich API is not an end in itself but is rather subservient to creating the most excellent experience we can for our customers.
I believe that our WinFx mission messaging to our own developers is not consistent with getting the best quality product out. I believe that we need to get back to encouraging the tightest and simplest code that will address the needs of our customers. Less than this is certainly not acceptable, more invites disaster as well.
There is no one magic bullet to fix a problem like lost performance culture. You do it in many small ways: Good review goals, performance pushes, refocusing resources on performance in key areas, performance centric design reviews, performance plans (with budgets), performance cops (good cops and bad cops). All of these steps are needed.
Introduction
I have many negative messages in this email and so I thought before I get into things that I should start with something positive. I am very fortunate in my assignment. I am one of the few people who is able, paid in fact, to see many different aspects of Longhorn and WinFX as they evolve. I have to say that it is very exciting to see many of these developments and gratifying to be able participate in their ripening.
However, before we complete our task, shipping the next major operating system, we have several daunting obstacles which we must avoid and I do not believe we are on a path to avoid all of these yet.
I must apologize for sending a message which I believe is full of trite advice yet I do not feel that we are yet acting appropriately on all of these fronts and it is my hope that highlighting these points will help focus the need. I do not wish for Longhorn to suffer Cairo’s fate.
Complexity
Perhaps the greatest single issue we face is the overall complexity of the system. This system is so large that certainly no single person, and perhaps not even any one team of architects, can understand it all. Despite this, our trend is and remains to add additional complexity as the solution to most if not all of the problems we encounter.
As the complexity grows the ability of line engineers to grasp the workings of the entire system vanishes and duplications and redundancies become rampant. This first manifests itself as additional complexities and wastages such as the presence of many separate and subtly different XML parsers — I am aware of at least 4 different ones. But that is only a minor problem. As the complexity further advances, even the most advanced engineers begin to fear to make changes in important functional areas because they cannot predict the consequences. I have attended many meetings where, for instance, Click-Once install and run was described and I have to say I was universally daunted at those meetings, trying to consider the interactions with other behaviors of the system (such as security and versioning) and I believe I am not alone in this regard.
Substrate
A great deal of the new code we are writing is managed code. Many people feel that this is basically a disaster but I disagree. There have been successful systems based on managed models to the metal (e.g. Taos/Cedar) and there is no reason to believe from past experience that this path is fundamentally doomed. However, notwithstanding the theoretical possibility of building a system out of managed code we have a very real problem. A good deal of our managed code platform has a RAD legacy. It should be no surprise that a framework that substantially targets that class of developer should have features useful to such a developer; however, RAD development is substantially different than platform development. Many classes were not written with the goal of minimizing cost of execution but rather cost of development. So you see things like error checks at many different levels in very base level classes (like enumerators) eroding a few percent here and a few percent there.
Compounding this problem, some of the practices which are most egregious from the perspective of platform code have a tendency to be echoed up into higher level classes where they become even more acute, or must be masked. A classic example of this is the extensive use made of exceptions in typical managed code. It becomes very tempting to use exception management to simplify contracts to the point where exceptions become less than exceptional. This nets effects like the recent statistic I heard (which I have not confirmed but which I find credible) that over 100 exceptions are thrown and caught in the process of a normal “dir c:\” command in the Monad managed shell (the accuracy of this anecdote is not important, I’m not trying to make a statement about Monad, but rather the ease with which one can abuse exceptions). [Monad was ultimately wildly successful and became Powershell]
It’s important to note that the managed platform in no way mandates that exceptions be used at all, certainly not any more than C/C++ does. In fact the irony of the situation is that it is because of the ease and richness of exceptions in the managed system and the reduction in tax associated with them (because there are no destructors to worry about calling so state management is cheaper) that exceptions are more popular than ever. But the fact that seems to have been lost is that although the overall cost of having exception handling present in the code is lower than it was in unmanaged code, the cost of actually throwing exceptions is higher! Many find this fact astonishing but the situation couldn’t possibly be otherwise — a tie is the best one could hope to achieve.
I believe that there are key innovations and corrections required in the managed platform but that fundamentally those are achievable. However I also believe that the general level of knowledge of managed systems and effective ways to code against them must increase or low level improvements will be washed away.
Techniques
A system that is difficult to use incidentally causes economy in its usage, this can sometimes have pleasant side effects. Contrariwise, a system that is easy to use is also easy to abuse.
One of the larger tragedies of managed code is that it tends to make everything uniformly easier to do. Even things that really aren’t such a good idea are still comparatively easy. Consider for instance the fact that there really is only one way to allocate memory and that it is a very easy way. Because of this it is exceedingly simple to allocate arbitrary amounts of temporary storage, thereby making it simple to create arbitrary growable data-structures for any need. In fact there are built-in classes (e.g. ArrayList) that make it even easier.
Because this is all so easy, it’s very tempting for a developer to actually do it. So you find that people have written (very easy) code to read entire file chunks into an in-memory representation that they can then reason over, extract what they need, and then throw it all away. Now it’s possible to do this in unmanaged code, but most developers wouldn’t because they’re faced with daunting problems like “How do I allocate this memory anyway?”, “If my algorithm fails, how will I clean up after myself?”, and “How do I know when it’s safe to free all this stuff?” As a result, it’s much more likely that a developer would find a way to do all the work he/she needed to do without doing any allocations at all, which actually is probably a good thing. That same allocation-free code would almost certainly have worked under the managed runtime but there was no motivation to code it up because it wasn’t the easiest way.
Another irony of managed code development is that sometimes the virtual machine is actually quite good at handling typical mistakes that have been made and so the problem gets masked until much later. For instance, an algorithm that causes the creation of many hundreds of thousands of temporary strings might go unnoticed in unit testing because the garbage collector is exceedingly good at reclaiming all those dead objects. However, when used with say a second thread the situation becomes much more complex and it isn’t until then that the bad pattern is noticed. If those allocations had been done instead by the likes of malloc() and free() the performance likely would have been bad enough in the first instance to avoid the problem.
Dependencies
As long as there have been libraries developers have been taking dependencies on them, but perhaps never so blindly as is the case now.
Now this point again ties back to system complexity but I wanted to highlight dependencies specially. In the managed world we frequently have complex cross linkage between assemblies. This isn’t inherently a bad thing, having the ability to have rich interactions between assemblies is very convenient. However, in the unmanaged world using a library (via COM, or DLL) is generally somewhat more explicit. Additionally, there is usually considerable scrutiny before a DLL writer decides it’s a good idea to take a dependency on another DLL.
For reasons I do not understand, this level of scrutiny is entirely absent when using managed code. Platform developers are then stunned, after the fact, to find that their application now depends on, for example, the C# compiler, at run time (!) because they decided to use an XML Serializer which naturally generates dynamic C# code and compiles and loads it at run time.
There are two points to make here. First, I assure you that there are no laws of managed code usage that require an XML Serializer to invoke the C# compiler. The decision to do so was presumably made because it was appropriate for the scenarios in which the serializer was intended to be used. Second, there are many groups, and I do not wish to embarrass them all, who have come very far in their development cycle before they even noticed that the C# compiler was being invoked (in some cases over a dozen times) during their initialization sequence.
In general, all manner of people pick up massively powerful assemblies and do things with them. Sometimes very smart things and sometimes silly things but rarely in a thoughtful way, having understood the price they would pay. The end result of this is that the dependency graph of typical managed applications is often nearly complete.
Frugality and Culture
We live in a world where there are theme bitmaps in use that rival the size of the entire Windows 3.0 distribution. Sometimes this causes a bad attitude. Recently a PM quipped to me “Who cares, our users will soon have 1G of memory anyway, so whatever.” We’ve come a long way indeed from the bit-squeezing questions we used to ask to earn a job if that’s an acceptable attitude.
It’s dangerous to live in a world that seems roomy because it’s temping to spread out and take a lot of space without thinking very carefully about whether you’re using the space wisely.
I could be saying the same thing about processor cycles.
When our developers act like resources are in abundance, they build complex systems with complex data structures full of support for special cases. They don’t focus on doing a great job on the core problem; they may not even choose the most expedient solution because there might be some other solution that’s 10% more general and 25% more elegant, which is chosen even if though is 50% more complex and 100% bigger. We aren’t paying people to build tight simple systems — we reward them for “engineering masterpieces” and everyone wants to be the next guy to build his/her magnum opus at our expense.
Nowhere is this better evidenced in than in the awesome generality of basic infrastructure pieces like our config file readers (which burn 500k of temporary string storage reading machine.config in order to extract a few lines of options). Once again I’d like to assure you that it is possible to read 3 lines out of a 140k config file in managed code without allocating 500k of string data. However we don’t seem to be able to do it at this time. Other fine examples include our serialization services (we are now close to having one for every day of the week), and our security evidence/policy system which probably would have been an enigma to Alan Turing and certainly is to me.
We cannot possibly address our performance problems until and unless we start rewarding our developers for making simple tight systems that solve our customer’s problems with a minimum code complexity.
Competition
I have a lot of respect for the Linux community. I have even more respect for my friends who worked on Windows XP, and Win2k3. All of those are examples of excellent native code platforms and that is our competition. Our competition is not a lame Java benchmark, and certainly not some other interpreted RAD stuff. We have to compete against the very best that can be built and we have to bring a great deal to the table in that comparison.
This competition goes across every important dimension. When we’re talking about how we start up managed applications, we have to compare that startup time to classic native application startup time because that’s what my mom has right now. When we’re talking about getting personal contact info we should be comparing to the best that Outlook or Outlook Express can do. Where our old native version of a service was poor we should be comparing against what a good one would have been able to achieve, not what our old lame one did.
Generally, we must always be striving for excellence in terms of robustness and performance in all our new code not simply exceeding our old crappy standard.
Peer Review and Excellence
It used to be embarrassing to produce code that was big, and nobody wanted to do it. It seems a lot more fashionable now to make something that’s big and blame it on someone else who will blame it on someone else, who probably will blame it on me or someone on my team because we are the fount of all things managed.
It’s much easier to blame someone else than it is to look hard at your own system. So perhaps the fault is that the tools just aren’t up to the task? Well, while it is true that in some cases it’s still too difficult to diagnose problems in managed code, and it is certainly true that better tools would help, I nonetheless believe we are nowhere near being blocked on tools at this time. It is not the case that we are at the beach looking for grains of gold in the sand, today what we are looking for is a large dump-truck on the beach that got stalled and needs to be towed. When we stop seeing things like a dozen C# compiler invocations, 500k of strings, 33% of all allocations being enumerators on temporary arrays, and other effects of that size, then we can worry about tools for finding the gold dust.
Culturally, where once tight code was a core value of our “hard-core” engineers we now tread carefully so as to not offend anyone if some bad code gets checked in. Reverting to the “hard-core” days is probably unwise but I think there’s some room to still be ruthless about bad coding practices.
Conclusion
There are many factors combining to create risk for managed code in Longhorn. The nimbleness of managed code itself creates a certain inherent risk. The fact that it is a new platform, the overall complexity of the system we are building and what we are building it upon, all contribute to general risk. Compounding this is a lack of cultural focus on excellent performance and simplicity in favor of focus on adding new features and complexity.
I believe that if we do not act to simplify Longhorn as much as we can, and reestablish our core engineering principles of tight frugal code, we will build a system that is not marketable at any price.