Software Development not at School: C#

Showing posts with label C#. Show all posts

Saturday, March 31, 2012

On exceptional virtues

Some time ago, a friend of mine let me know about this post by Joel on software. I usually tend to agree with Joel's opinion. But I couldn't align with this piece which is strongly in favor of returning error codes rather than throwing exceptions. So I am going to propose the counter-point and explain why I'd rather use exceptions.

First, let me get this straight: even though I happen to know C quite well, on the contrary C++ is a foreign and feared programming language to me. So please keep in mind that what I have to say applies mostly to languages with garbage collection such as C# or Java. Languages must be understood as a whole and some traits can be incompatible. And so it goes with exceptions too.

I believe programs which use error codes tend to be more verbose and intricate. Consider these two programs:

try
{
    input = read();
    result = process(input);
    output(result)
}
catch (SomeException)
{
    report_error();
}

and

input = read();
if (input == null)
{
    report_error();
    return;
}
result = process(input);
if (result == null)
{
    report_error();
    return;
}
if (!output(result))
{
    report_error();
}

So which one would you rather have?
It can even get worse as half of the population of developers seem to favor functions with a single-exit point. The previous example then becomes:

input = read();
if (input != null)
{
    result = process(input);
    if (result != null)
    {
        if (!output(result))
        {
            report_error();
        }
    }
    else
    {
        report_error();
    }
}
else
{
    report_error();
}

We just made a 10 line function span over 20 lines. So is my brain under-performing? Or does using error codes increase the cognitive load? One thing is for certain. Avoiding exceptions forces you to violate two important programming principles:

Don't Repeat Yourself: because there are duplicate versions of the error handling code.
Single Responsibility Principle: since the logic of the program is interwoven with the management of exceptional cases.

Some C programmers overcome their inbred aversion for gotos in order to mimic exceptions and write code such as this:

  input = read();
  if (input == null) goto error;
  result = process(input);
  if (result == null) goto error;
  if (output(result)) goto end;
error:
  report_error();
end:
  return;

These programmers surely seem to agree with me.

Let us now suppose for a moment we still go for the lengthy version. After all, having more lines of code makes our productivity statistics look good, doesn't it? Now, we are faced with another problem: how are we going to pass some error code for functions which already return some value. We have several alternatives, each of them with its own set of unwanted consequences:

as we just saw in the previous example, we can use null to encode an error,
encode the error codes inside the return value,
or use an out parameter to store the return value of the method.

Obviously, I wouldn't advise using null, since forgetting a check for null most probably results into a null pointer exception to be raised later in the code. And you are back to square one. Additionally, the exact interpretation of null may sometimes be blurry. Let me give you a concrete example from the Java standard API. Interface Map from java.util has a method get which returns the value mapped to a given key, or null if the key is not mapped to any value. However, and herein lies the ambiguity, if the map permits null values, then null may either represent a mapping or no mapping at all. It is then advised to use method containsKey to distinguish between these cases. What an insecure API! It is safer for the method's user to either disallow null values, or adopt the solutions used by the C# class Dictionary: raise a KeyNotFoundException!
The C# Dictionary interface proposes an alternative way to fetch a value via the TryGetValue method. This method returns a boolean to indicate the presence of a mapping and an additional out parameter to pass the value. This practice has its benefit, which is to avoid the sometimes (but rarely) prohibitive cost of throwing and catching an exception. I will briefly come back to this point later in this post. However, in general, it is not a good practice to modify the state of a parameter of a method which has a return value. Indeed, I believe the separation of methods into either command or query as described in the book Clean Code: A Handbook of Agile Software Craftsmanship to be sane. A quote from this book goes:

Functions should either do something or answer something but not both. Either your function should change the state of an object, or it should return some information about that object.

Sometimes the error code needs to contain more information than just a boolean state. Some coders find it smart to encode the error state in the same type than the one used to store correct values. For instance, if a method normally returns positive integers, then negative integers could be used for error codes. This is a terrible practice which more than often leads the caller to not handling the error code at all.

Contrast these convoluted solutions with the simplicity of wrapping any error information into an exception. Plus you get the stack trace for free!

At last, error codes must necessarily be dealt with by the caller. Sometimes the caller does not have anything reasonable to do. And so she must bear the burden of propagating errors upwards. Not only that, but also, she must be careful not to forget handling any error code. Instead exceptions automatically climb the call stack until they are caught. In the worst case, all exceptions may end in a catch all at the top level of the application. Also, in Java, exceptions which do not inherit RuntimeException must be declared in the method signature. No more silly mistakes!

There may be only one reason to prefer error codes over exceptions. And that is performances issues. Even though using exceptions does not change the complexity of your algorithm, they are said to be very costly both in C# or Java. Well, let us put hard numbers on this belief.
I wrote two versions of a program which tries to retrieve several values from and empty dictionary. The first version throws exceptions:

Dictionary<int, int> dictionary = new Dictionary<int, int>();
int count = 0;
for (int i = 0; i < N; i++)
{
    int result;
    try
    {
        result = dictionary[i];
    }
    catch (KeyNotFoundException)
    {
        result = -1;
    }
    count += result;
}

Whereas the second program returns error codes:

Dictionary<int, int> dictionary = new Dictionary<int, int>();
int count = 0;
for (int i = 0; i < N; i++)
{
    int result;
    if (!dictionary.TryGetValue(i, out result))
    {
        result = -1;
    }
    count += result;
}

I ran both programs with the loop upper bound N successively equal to 1, 2, 4 and 8 million! Here are the durations of each run in milliseconds:

N (in million)	1	2	4	8
Error codes	46	78	202	343
Exceptions	111774	221208	442057	882788

Without any doubt, error codes win the speed test, being roughly 2500 times faster than throwing and catching exceptions. However, keep in mind we are experimenting with exceptions in the million. A quarter of an hour to process 8 million exceptions does not seem outrageous. So unless you are in a case where exceptions are both frequent and on a critical path of your program you shouldn't bother. As always, premature optimization is the root of all evil.

To conclude, I hope I convinced you that using exceptions lets you avoid some code duplication, enables separation of concerns, improves methods API and decreases the risk of forgetting error cases.

To sum up, here are the take-away lessons from this article:

always prefer exceptions to return codes,
in Java, try to avoid RuntimeExceptions,
optimize exceptions out only if really necessary.

Dali, The Temptation of St. Anthony

or exceptions propagating up the program?

Thursday, June 2, 2011

Testing internal classes: quality and minimal interface at the same time

Today, is a brief entry. I just found a very satisfactory answer to one of my everlasting questions about non-regression testing. Namely, as stated in this previous post, how is it possible to test internal classes while keeping all nunit test code in a separate project?

Well, there is an attribute from System.Runtime.CompilerServices, that lets you do just that. Let's say your test project is called TestSuite, then all you need to do is add the following line to AssemblyInfo.cs in the project under test:

[assembly: InternalsVisibleTo("TestSuite")]

Now all internal classes and methods are visible to the test methods in TestSuite. So you can do deep non-regression testing of an assembly and still present a minimal interface to your users. Pretty neat, isn't it?

If I am not wrong, this problem does not arise in Java. In Java, the notion equivalent to assembly-internal is package-internal. So even though junit code is in a distinct project, it suffices to declare it as part of the same package as the project under test in order to gain full access to all package-internal classes and methods.

As often, stackoverflow has it all: this answer originates from there.

René Magritte, Decalcomania

Wednesday, March 9, 2011

Coding rules for clean and robust C# code

Introduction

This post presents my coding standards for writing clean and consequently robust C# code.
As I stated in a previous post, more time is spent understanding, debugging and reading code than actually writing it. So it is many times worthwhile spending some effort writing clear code upfront.

These standards have thus two main objectives, namely to:

set a limit on code complexity, be it for its size or shape, so that all coding tasks remain always feasible,
establish a normalized way of writing code, so that anybody can quickly understand any part of the project.

In an ideal world, there would be only one best way of writing any piece of code; and the meaning of the code would just be plain obvious at first read. Following the rules listed in this post, is, in my opinion, a step towards this ideal.

I am not claiming this is the only valid set of coding rules, or even that it is complete. However, I have seen many ways to fail a software project and following these rules surely limits risks.
Even though, some rules may seem arbitrary, they are all motivated by, sometimes unfortunate, past experiences. So, if you do not understand the rationale for some rule, please, let me know and I will try to write further explanatory posts.
At last, I tried to order rules by importance. Very often, coding standards are so long that nobody reads, even less applies them. So if you feel this is too much, try to follow the first few rules only. And maybe, in the future, some tool will be able to check all these rules automatically.

Driving principles
First, here are a few general and driving principles:

Keep it simple: try to implement the simplest solution that solves the requirements.
Do not over-design: in other words do not code for yet inexistant but potentially future requirements.
Do not early-optimize: performance problems are solved after they are spot, not when they are imagined.
Remove known bugs before coding new functionalities.
Write at least one automatic non-regression test, before fixing any bug. This is really the minimum. Writing test before coding (full-fledge TDD) is best.
Refactor and clean up code often.

The coding priorities are:

robustness: the code does not unexpectedly stop,
correction: the code does what it is expected to do,
performance: the code is sufficiently efficient both in time and space.

Bojagi, or the minimality of patterns

Size limitations
The cost of all software development tasks grow, non-linearly, with the size of the code. So the most important rule is to try to limit the line count. Here are some limits I like to apply:

1 class per file,
less than 20 000 lines of codes per solution (if not create libraries),
less than 1000 lines per class,
less than 100 lines per method,
less than 150 characters per lines,
less than 10 fields per class,
less than 4 arguments per method.

Please keep in mind these are approximate values that give an idea of the optimal scale for each category.

Forbidden language constructs
Here are language constructs that are absolutely forbidden:

Static methods, except for Main.
Static fields (unless they are private readonly constants).
Nested classes. Why hide some code that could be reused? Also, it is better to have the whole complexity of your architecture apparent.
Goto statements.
Macros, (in particular, regions (#region, #endregion), conditional code (#if (DEBUG))) are forbidden.
The ref method parameter.
The new method modifier.
Operator overloading.
Dead code.
Obsolete code that is commented out. After all, you should be using a powerful source control system such as mercurial (read this and that, if you don't know what I am talking about).
The ternary conditional expression c?e1:e2.

Strongly discouraged language constructs
Here are language constructs which should be used scarcely:

The null value can always almost be avoided. In particular, you should never pass the null value as an argument. It is better to create another variant of the method. Returning null can be advantageously replaced by throwing an exception. In other cases, you can use the null object pattern. When null is used to encode a special case, then document it explicitly.
Casts are forbidden in general. They may be necessary to implement powerful functionnalities, such as dynamic class loading. But then, they must be hidden in thoroughly tested libraries. Their use must be motivated and documented.
Generic classes should be rarely defined, even though their use is encouraged. Use generic classes for only one thing: when you need a collection of elements with all the same type. This practice avoids unnecessary casts. Any other need should be carefully evaluated. In particular, powerfull à la C++ tricks are banned.
Comments should be as few as possible. Special one line comments, starting with TODO: to signal not so important tasks or interrogations are allowed. Nothing worst than paraphrasing comments such as:
```
// Creates an instance of MyClass
this.myField = new MyClass();
```

Reducing and expliciting dependencies

Use the most restricting visibility for classes: internal then public.
Similarly, use the most restricting visibility for fields, properties and functions: private then internal protected then internal then protected then public.
Minimize using directives. Put them all at the start of the file rather than fully qualifying class names in the code.
Access fields through this. rather than prefixing them with m_ or _.
Try to abide by the law of Demeter: avoid successive fields accesses in the same statement.
Minimize and comment the use of delegates and events. They make the control flow harder to follow!

Structuring statements

Initialize fields in the constructors rather than next to their declaration.
Declare all objects, that implement the IDisposable interface, with the statement using. In rare occasions, it may not be possible, in which case you should document how the object will be disposed. Classes with the IDisposable interface are for instance streams and Windows forms.
Boolean conditions should not have any side-effects.

Reduce code imbrication level. In particular, prefer short conditions and early exit, so that the following code:

if (x != null && y != null)
{
    x.DoSomething(y.GetSomething);
}

is rewritten into:

if (x == null) return;
if (y == null) return;
x.DoSomething(y.GetSomething);

Constants must be declared const whenever possible, static readonly otherwise.
Do not use error code, always prefer exceptions.
Ensure that all exceptions get caught at some point. In the worst case, set a general exception handling mechanism at the root of the program. Note this may not be that straightforward because of threads and the message loop of graphical applications.
Adding a logger is a good idea before shipping code.

Process rules
Bug tracker and project management

A bug-tracker such as Redmine should be used.
Tickets must be solved in the order of priority.
Once a ticket is solved, it should be first sent back to its author, who is the only one authorized to validate and close it.

Rules for commit messages

Use a version control system such as Mercurial.
All commit messages are prefixed by one of the following:
[add]: adding of a new feature,
[fix]: fixing a bug (optimizations fall under this category),
[clean]: cleaning up code,
[doc]: for any comment,
[spec]: adding a new test
[merge].
Whenever possible, it is also advised to indicate a ticket number in the commit message.

Organisation of a solution

Every Mercurial repository stores one solution.
Every solution has a project that produces either a library (.dll) or an executable (.exe).
Every solution has a project PrePush that contains pre-push tests, and a project PostPush for lengthy tests.
The central repository compiles and executes all pre-push tests as a hook before accepting any push.
A continuous integration system such as Hudson executes both pre-push and post-push tests. It also triggers the check of downstream solutions.
Every solution has a version number, preferably generated from the source control system number.
For library projects, an automatically generated documentation, with a tool such as Doxygen, is advised.

Non-regression tests

NUnit non-regression tests are mandatory.
The minimum consists in writing a test before fixing any bug. That way at least the same mistake is not made twice.
Better yet, apply TDD. That is, write tests before even starting to write code.

Syntactic conventions
A project should follow syntactic conventions, so that reading the code is easier for everybody. I believe that code should be light, succinct and go to the point. So I tried to choose conventions that remove any unnecessary hindrance to one's understanding.
Note that limiting complexity, compartimentalizing code and writing automatic tests are much more important than focusing on the small details of instruction syntax. So the rules stated in this section of the coding standards are really the least important. However, surprisingly, most coding standards seem to contain several (boring) pages of such rules. Note also, that most of these rules can be automatically checked by tools of minimum intelligence.

Naming conventions

Use camel-case, underscores "_" are fobidden except in test suites.
Project, namespace, class, interface, struct, method and properties names start with a capital letter.
Fields and variables start with a lower case.
Interfaces start with I.
Correct:
struct Data;
int bufferSize;
class HTMLDocument;
String MimeType();
Incorrect:
struct data;
int buffer_size;
class HtmlDocument;
String mimeType();
Avoid abbreviations (except, maybe, for loop counters and X-Y coordinates).
Correct:
int characterSize;
int length;
int index;
Incorrect:
int charSize;
int len;
int idx;
Getters should have the same name as the field they access, except their first letter is capital.
Correct:
int Count { get { return this.count; } }
Incorrect:
int getCount { get { return this.count; } }
Prefix boolean variables by is, has...
Correct:
bool isValid;
Incorrect:
bool valid;
Prefix event names by Event,
Prefix methods that handle an event with On,
Prefix delegate types by EventHandler,
Prefix methods and properties created for test purposes only with Test,
Prefix test methods that trigger an event with TestTrigger.
As much as possible, method names should contain a verb. Try using few different verbs. Avoid fancy verbs whose meaning is not clear to everybody. Try using the same verb for the same concept.
Prefix field accesses by this., rather than prefixing fields name with "_", or "m_".

Spacing

Care should be taken in presenting statements. A code with irregular layout looks messy. It is an indication of the poor care that was put into the code. And, according to the broken windows theory, it encourages further degradations from other coders.

Here are some of my preferred layout conventions:

Each instruction, even short ones, has its own line.
Correct:
x++;
y++;
Incorrect:
x++; y++;
Braces have their own line. This rule applies regardless of the element that the braces enclose: namespace, class, method, property, conditionals... Is this convention better than the java one where the opening brace is on the same line as the keyword? Frankly, it doesn't matter much, as long as you are coherent. But if you really ask me, having the braces on their own lines, makes it easier to visually identify block, plus an if statement can quickly be suppressed with a single one line comment.
Correct:
```
void Execute()
{
    DoIt();
}
if(condition)
{
    DoTrue();
}
else
{
    DoFalse();
}
```
Incorrect:
```
void Execute() {
    DoIt();
}
if (condition) {
    DoTrue();
}
else {
    DoFalse();
}
if (condition)
    DoTrue();
else
    DoFalse();
if (condition) DoTrue(); else DoFalse();
if (condition) DoTrue(); else {
    DoFalse();
}
```
For very short bodies, such as those of getters, everything can be put on the same line. In this case, spaces around the braces are mandatory.
Exemple:
int GetInfo { get { return this.info; } }
For if statements, if the body statement completely holds onto one and there is no else, then the braces are omitted.
Exemples:
if (condition) return;
if (condition) break;
if (condition) x = 0;
if (condition) continue;
Put a space after keywords if, while, switch and foreach.
Correct:
```
if (condition)
{
    DoIt();
}
```
Incorrect:
```
if(condition) {
    DoIt();
}
```
Boolean expressions that do not hold on one line should have their operators aligned to the left.
Correct:
```
if (b1
    || b2
    || b3)
{
}
```
Incorrect:
```
if (b1 ||
    b2 ||
    b3)
{
}
```
Avoid spaces around unary operators.
Correct:
i++;
Incorrect:
i ++;
Put spaces around binary operators.
Correct:
y = (m * x) + b;
c = a | b;
Incorrect:
y=m*x+b;
c = a|b;
Put no space before and one space after a comma, colon or semi-colon.
Correct:
DoSomething(a, b);
class A: I
{
}
for (int i = 0; i < 10; i++)
{
}
Incorrect:
DoSomething(a,b);
class A:I
{
}
for (int i = 0 ; i < 10 ; i++)
{
}
Avoid spaces between the name of a function and its opening parenthesis or between a parenthesis and an argument.
Correct:
DoIt(a, b);
Incorrect:
DoIt (a, b);
DoIt( a, b );

Questions

Here are a few things I would like to ask you:

Does your project have coding standards?
Are they known and followed by the team members?
Do you think it is a good idea to ask for all non-negative integer values to have the type uint? I noticed in practice, the type uint seems to be very rarely used.
What about software costs estimation models such as COCOMO? Can these kind of models bring any value to the development process? How?

Sunday, July 4, 2010

Beautiful code: a sense of direction

Whenever I hear someone boast about the huge number of lines he produced for his last project, I silently rejoice of not having to take over his code. To be honest, like most industry programmers, I was often asked to expand code I did not write in the first place. Maybe you did too. If so, you must know how often one can experience moments of profound despair, pondering at endless lines of incomprehensible code. For those of you lucky enough to ignore this feeling, let me just say that, every single time, I dug up countless skeletons, a lot more than the ones hiding in the closet of your most crooked politician. Sometimes, I believe that funambulism while plate spinning at the same time might even be easier. A single move and everything collapses. Invariants break, exceptions fly and your manager loses numerous days of life expectancy.

Working on someone else's code, harder than jultagi?

Let us now be honest and consider the reverse situation. I vividly remember when one of my coworkers confessed, slightly embarrassed, what he really thought about my code. He found it simply unintelligible. I was taken aback! Until then, I had no idea that, what I took great pride in, could be so unpleasant for someone else. You may laugh at me, but frankly, do you still easily understand code you wrote only 6 months ago?

In fact, writing a piece of code is an order of magnitude easier than reading it. In that respect, some languages fare worse than others: for instance Perl is renowned as being write-only. Not only that but also beauty is said to lie in the eye of the beholder. So objective beauty, even for software, must not exist! For a very long time, I thought so. And it was extremely frustrating. I was not able to rationally explain why some piece of code did not feel right. Neither could I show other programmers how to improve their code writing skills. My only alternative was to lead by examples or resort to arguments from authority.

Now, I believe that code beauty rests on only 3 principles. They are, if you will, the golden ratios of software. Despite their relative simplicity, they are surprisingly rarely followed by programmers, even advanced ones. However, if applied systematically, they invariably lead you to beautifully crafted pieces of code. With them, I gained a sense of direction that I hope to convey in this post.

A golden ratio for software?

My three golden principles of code beauty are:

Less is more,
Local is manageable,
Innovation is risk.

Less is more, often referred to as the KISS rule (Keep it simple, stupid!), means you should always try to minimize the complexity of your programs by decreasing the number of lines, classes, fields, methods, parameters... Good looking software is fit! The rationale for this principle is easy to understand: how much longer does it take you to read 1000 lines of code rather than 100 lines? Probably 10 times longer, provided you never get tired. This gets worse if you have to understand, debug, rewrite, and cover with tests 1000 lines of code rather than just 100. There is another, less considered, aspect of this principle: smaller programs have less room for irregularities. Copy-pasted code, duplicated states, complex logic are less likely. So, mechanically, the same amount of tests have a higher coverage, corner cases are fewer and bugs do not hide. Reducing code size is bug hunting! At last, for those of you who, like me, are fond of arguments by passing to the limit: an empty piece of code can simply not crash!

Local is manageable, also known as the separation of concerns principle, means you should structure your code in classes of limited size. At the same time, coupling between classes should be minimized. In C# this implies that methods access modifiers are chosen in this order: private first and then internal protected, internal, protected and finally public. Chinese junks were probably the first ships divided into watertight compartments. If the hull was damaged in one place, these subdivisions would circumscribe the flooding and help prevent the ship from sinking. In the same way, software compartmentalization limits propagation of bug impacts. Code modifications can be performed locally, small parts replaced easily. In great software, richness in behaviour is achieved by the combination of multiple small and simple components rather than by the gradual stratification of code into a monolithic and complex ensemble.

Chinese junk: compartmentalization at work!

In this age of unlimited belief in the virtues of progress, you probably won't often hear the last principle: Innovation is risk. However it should be common sense that every technology comes with its own risks. In that respect software is no different. If I can implement all the required functionalities with some integer fields, classes and methods, I am the happiest man. I do not enjoy worrying about when to release files, how floating points are rounded, which method is called through a delegate or whether my threads won't deadlock... Most programmers are immediately aware of the benefits of advanced language constructs. But few, myself included, really understand how to avoid misuse. Manipulate gadgets with care and maintain a sense of mistrust towards novelty!

I find these three principles particularly relevant because of their adequacy with some characteristics of the human mind, namely:

the inability to keep track of the state of more than a few items at the same time (due to the limits of our working memory),
the aptitude to consider a system at various scales,
the tendency to replicate similar behavior and to hold default positions.

To sum up, between several pieces of code that perform the exact same task, I prefer the smallest, most structured and conventional one. So I try to write code that complies with these principles. It takes admittedly a bit longer than just writing something that works. But because I have learned how hard it can be to read code, I know that the gain in readability largely and quickly pays back.

As you may know, writing crystal clear code right away is almost impossible. More importandly, software has a natural tendency to grow. So much so that coding can be considered as a constant fight against entropy! To avoid code decay, regular refactoring is recommended. Refactoring techniques are code transformations which preserve the external behavior of programs. Together with non-regression tests, they are a fundamental tool to clean up code without breaking it. Almost every refactoring has its inverse transformation. So, without a clear objective in mind, it can be confusing to know which one to choose. Fortunately, when guided by the three aforementioned principles, this dilemna disappears. Let me end this post with a list of a few refactoring techniques. If you want to explore this topic in more depth, SourceMaking is a good place to start.

As you will see, most refactoring techniques are straightforward. But, as I like to say, there are no small simplifications. Small steps have a low cost and can be easily made. Yet, they often open up opportunities for bigger transformations. Also consider the fact that any major code change can always be broken down into a succession of very small steps.

First, here are a few refactoring that reduce the level of access modifiers:

if a public method is never called outside of its assembly, then make it internal,
if an internal method is never called outside of its subclasses, then make it internal protected,
if an internal method is never called outside of its class, then make it private,
if a private method is never called, then remove it,
make all fields private and create setters/getters whenever necessary.

Then, try to decrease the number of fields and variables by defining them at the appropriate scope. Here are some code transformation to do so:

if a field is used in only one method, then make it a local variable of the method,
if a local variable is assigned once and then immediately used, then remove it,
if the same value is passed around as a parameter of several methods, then introduce a field to hold it. This may be a good indication that the class could be split in two parts: one to handle all computations related to this value and the remaining.

Reduce as much as possible the overall complexity, roughly evaluated by the number of lines of code, of your classes. To do this, start by successively reviewing each field. For this task, an IDE which is able to automatically find all occurences of a variable, helps. Fields come in two flavours:

some are initialized in the class constructors and then never modified. This is in particular the case for all fields in "functional" objects. These fields should be made readonly,

other hold the state of the object and are constantly modified. Remove any such field when their value can be obtained by a computation from other fields in the class. Replace these fields by getters. This simplification removes the difficulty of preserving complex invariants between different pieces of data.
To make things more concrete, here is a first example:

class Point
{
    int x;
    int y;
    int distance;

    public void ShiftHorizontally(int deltaX)
    {
        this.x += deltaX;
        this.distance = this.x*this.x + this.y*this.y;
    }

    public void ShiftVertically(int deltaY)
    {
        this.y += deltaY;
        this.distance = this.x*this.x + this.y*this.y;
    }
}

which could be rewritten like this:

class Point
{
    int x;
    int y;
    int Distance
    {
        get
        {
            return this.x*this.x + this.y*this.y;
        }
    }

    public void ShiftHorizontally(int deltaX)
    {
        this.x += deltaX;
    }

    public void ShiftVertically(int deltaY)
    {
        this.y += deltaY;
    }
}

You can also apply this refactoring when the same value is reachable through different fields, like the field hasRichTaylor in the following class:

class Person
{
    int wealth;
    bool IsRich
    {
        return this.wealth > 1000000000;
    }
    Person taylor;
    bool hasRichTaylor;

    public ChangeTaylor(Person taylor)
    {
        this.taylor = taylor;
        this.hasRichTaylor = taylor.IsRich;
    }
}

Class Person is safer written this way:

class Person
{
    int wealth;
    bool IsRich
    {
        return this.wealth > 1000000000;
    }
    Person taylor;
    bool HasRichTaylor
    {
        get { return this.taylor.IsRich; }
    }

    public ChangeTaylor(Person taylor)
    {
        this.taylor = taylor;
    }
}

Sometimes, you can decrease the number of occurences of a field by replacing several method calls by a single call to a richer method (for instance prefer one call AddRange over several calls to Add).
Finally, I like to use automatic setters/getters over fields whenever possible, I would write:
```
class Tree
{
    public int Height
    {
        private set;
        get;
    }
}
```
rather than:
```
class Tree
{
    private int height;

    public int Height
    {
        get { return this.height; }
    }
}
```
Other than the fact that in the second solution, the getter Height sometimes ends up far from the field height, this is admittedly a matter of taste...

Moving code between methods, is another extremely effective way to reduce class complexity. For instance, if a method is called only once, then obviously inline it. Don't worry about the fact that later, you will maybe need this method elsewhere. For now, just reduce your code size. If needed, you can always apply the inverse transformation which consists in creating a new method to hold a piece of code that is duplicated in two or more places. There are several variants of this refactoring:

if you access a field through a chain of fields, such as a.b.c, then define a direct getter C,
if you evaluate several times the same expression, such as this.LineHeight*this.lines, then define a getter this.Height,
if you have two loops with the same body but different ending conditions, then make a method with the loop boundary as parameter.

Whenever you have several methods with the same name, but different parameters, you should try to make one method do all the work and all the other ones call it. This refactoring is particularly relevant for class constructors. Note that, if you can't do it easily, it may be an indication that the several versions of the method do not all perform a similar task and some of them should be renamed.
At last, moving code from several callers inside the body of the method which is called, is a very effective code reduction. There are many variants of this refactoring, but the idea is always the same:

all calls to f are followed by the same block of instructions, then push this block down into the body of f. Sometimes, you may have to add some additional parameters to f,
all creations of a class A is followed by a call to some method in A, then call this method directly from the constructor of A,
method g is always called just after method f, then merge both methods and their parameters together,
the result of a method f is always used as the argument of a method g (in other words you have g(f(x))), then insert f inside g and change the signature of g,
a method g is always called with the same constant parameter g(c), then remove the parameter and push down the constant inside g,
a method g always takes a new object as argument: g(new MyObject(x)). This is an indication that the signature of g is not well-chosen, and the object should rather be created inside g.

In object oriented programming style, conditionals are usually few. Here are some techniques to decrease the conditional complexity of your code, roughly measured by the width, i.e. the maximum level of imbrication:

try to avoid the null value. There are very few cases when you really need it. If you ensure your variables are always initialized, null testing can be removed altogether. In a later post, I will explain why I am generally not in favor of defensive programming,

rather than checking a method argument is valid and perform some computation, it is nicer to filter out incorrect argument values and return early. More concretely, the following code:

int result = 0;
List<int> content = this.CurrentData;
if (content != null)
{
    if (content.Count > 0)
    {
        result = content[0];
    }
}
return result;

is rewritten into:

if (this.CurrentData == null) return 0;
if (this.CurrentData.Count <= 0) return 0; 
return this.CurrentData[0];

obviously when the true branch of an if then else ends with a breaking instruction, then the else branch is not necesssary:

if (condition)
{
    // do something
    continue; // or break; or return;
}
else
{
    // do something else
}

becomes:

if (condition)
{
    // do something
    continue; // or break; or return;
}
// do something else

when two branches of an if then else have some code in common, try to move this code out, either before or after the conditional block. Example:
```
if (condition)
{
    f(a, b, e1);
}
else
{
    f(a, b, e2);
}
```
becomes:
```
int x;
if (condition)
{
    x = e1;
}
else
{
    x = e2;
}
f(a, b, x);
```
if a condition inside a loop does not depend on the loop iteration, then try to put it before the loop,
when you have a sequence with a lot of if else if else if else, try to use a switch statement instead,
At last, always prefer enum over integer or string, the risk of forgetting to handle a case is lower.

The presence of an enum field and several switch instructions is a good indication that a class could be broken down into a hierarchy of smaller classes with a common abstract ancestor.
Breaking down a large class undoubtedly improves the overall architecture of your program. However, determining which fields and methods to extract can be a hard task. Try to spot fields that tend to be together in all methods.
Get rid, as much as possible, of static fields and methods. It is always possible to write a whole program with only static methods. But then, it is easy to lose the purpose of each class. Since class hierarchy is not driven by the data anymore, it is harder to feel the meaning and responsability of each class. The tendency to make larger classes increases. Data tends to be decoupled from their processing code. The number of parameters tends to grow. And static fields unnecessarily use up memory... So every time you see a static method, try to move it inside the class of one of its paramaters.

Applying these refactoring may not always be obvious. You will often need to first slightly bend the code to your will. Do not hesitate to reorder two instructions, remove a spurious test, or add some instructions. By doing this you will increase code regularity and often find corner case bugs. Besides, please note that I am not advocating to reduce code by shortening the size of variable names, or removing information which implicitely holds. On the contrary, I believe variable and method names are the best documentation and should be carefully chosen. Also, even though unnecessary, I prefer to explicitely state when fields or methods are private, when classes are internal and prefix references to class variables and methods by this.

In this post, I presented several refactoring techniques. More importantly, I presented my canons for code beauty in the form of three principles:

Less is more,
Local is manageable,
Innovation is risk.

These principles give you a sense of direction when simplifying code. Because I believe a piece of software is either simple and easy to understand or wrong and bug-ridden, I perform refactoring regularly. It has become a pleasant habit and a soothing exercise. This practive keeps code complexity in check. It can also be helpful to continuously measure the evolution of code complexity. And I will cover this point in a later post. However, you should not worry about finishing code clean up quickly. Any code has quite some room for improvement. So it takes time and a strict feature freeze to achieve ultimate code beauty. As the famous verse by Boileau says:

Hasten slowly, and without losing heart,
Put your work twenty times upon the anvil.

Refactoring at work!

Here are my questions for today:

What principles lead your code developments?
Do you often perform refactoring, or do you avoid the risk of breaking code that works and would you rather patch?
I know my list of refactoring techniques is by no way exhaustive, what other fundamental refactoring do you like to apply?
Do you know of any automatic tool that either performs simplifications, or lets you know of possible code simplifications?

Sunday, May 9, 2010

Non-regression tests: camming devices for humble programmers

More than once, have I been surprised by the scarcity, or even lack, of tests in both proprietary and open source software. After years of coding, I learned maybe only one thing: we inescapably make mistakes. For some reason, somehow, a certain amount of bugs ends up in our code:

maybe the specification was fuzzy,
or some library did not behave as documented,
or a corner case could hardly be anticipated...

Mistakes are simply unavoidable. But, isn't it just plain stupid to make the same mistake twice? Hopefully, using non-regression tests ensure that any bug previously found and fixed is permanently removed. Similarly to camming devices in climbing, they save your progress and sometimes... your life.

Without or with non-regression tests?

In this post, I explain on a concrete example how to write non-regression tests, how to add them and when. All examples are written in C#, a language I am particularly fond of. Hopefully, readers knowledgeable in any imperative language such as C, C++ or Java should have no major difficulties understanding the code. I am using SharpDevelop, an open source IDE for C# and NUnit, an open source testing framework for any .NET language, including C#. NUnit is integrated by default in SharpDevelop. Alternatively, it can also run as a standalone program.

As a starting point, let us consider a simple, hypothetical coding situation. We are working on a word manipulation library. The library exports a unique namespace WordManipulations, which contains a class Sentence. This class wraps a string and provides some simple text manipulation operations. The first step consists in adding a new project, called Tests, dedicated to non-regression tests. It references both the dynamic linked library (DLL) nunit.framework and the project WordManipulations. We then create a class Suite00_Sentence in order to group all tests related to class Sentence. We place the attribute [TestFixture] to signal to NUnit that it is a test suite. Our first test method will be called Test000_CreationFromString and is similarly tagged by the NUnit attribute [Test]:

using NUnit.Framework;
using WordManipulations;
namespace Tests
{
  [TestFixture]
  public class Suite00_Sentence
  {
      [Test]
      public void Test000_CreationFromString()
      {
      }
  }
}

At this point, the organisation of the whole solution should look like this:

WordManipulations
- Sentence
Tests
- Suite00_Sentence
  - Test000_CreationFromString

Let us now check everything compiles nicely and our first test executes correctly. In SharpDevelop, simply click on the play button of the unit testing side-panel. If you like it better, you can also directly use the NUnit interface. To do so, first compile the project Tests. Then go to directory Tests/bin/Debug in the Windows file explorer and double-click on file Tests.dll. This should invoke NUnit. From there, simply click Run. In both cases, a green light should indicate the test suite ran successfully.

For the next step, let us fill in the body of Test000_CreationFromString. It specifies that an instance of class Sentence can be built from a string:

public void Test000_CreationFromString()
{
  Sentence empty = new Sentence("");
}

For this test to succeed, we implement the corresponding constructor in class Sentence as follows:

public class Sentence
{
  private string content;

  public Sentence(string content)
  {
      this.content = content;
  }
}

Let us go a bit further. Say we want a method LastWord to return the index of the last word in a sentence. An additional test simulates the most simple use case of LastWord. Namely, when the input is "a b", then the expected output should be 2:

public void Test001_LastWord()
{
  Sentence input = new Sentence("a b");
  int result = input.LastWord();
  Assert.AreEqual(2, result);
}

Class Assert from the NUnit framework provides several ways to check the expected output of tests.

An implementation of LastWord that validates this test could be for instance:

public int LastWord()
{
  int i = this.content.Length - 1;
  char currentChar = this.content[i];
  while (currentChar != ' ')
  {
      i--;
      currentChar = this.content[i];
  }
  return (i + 1);
}

However, soon somebody named Mary Poppins reports an array out of bounds on "Supercalifragilisticexpialidocious". We immediately write a non-regression test that reproduces the bug:

public void
  Test002_LastWordShouldNotFailOnSingleWordSentence()
{
  Sentence input
   = new Sentence("Supercalifragilisticexpialidocious");
  input.LastWord();
}

Before starting a long debugging session, we simplify the test to the bone. It turns out that the current version of LastWord even fails on an empty string. So we modify our test accordingly:

public void
  Test002_LastWordShouldNotFailOnEmptySentence()
{
  Sentence input = new Sentence("");
  input.LastWord();
}

To pass this test, we rewrite LastWord:

public int LastWord()
{
   int i = this.content.Length;
   do
   {
       i--;
   } while ((i > 0) && (this.content[i] != ' '));
}

We then run all the tests written so far. They all succeed, so we can go back to the initial bug report. The method seems not to raise any exception anymore on a single word sentence. However, it does not seem to return the correct value either. So we add another test:

[Test]
public void Test003_LastWordOnSingleWordSentence()
{
   Sentence input = new Sentence("a");
   int result = input.LastWord();
   Assert.AreEqual(0, result);
}

We modify our code once more, and hopefully last, time:

public int LastWord()
{
   for (int i = this.content.Length - 1; i > 0; i--)
   {
       if (this.content[i] == ' ') return (i + 1);
   }
   return 0;
}

In the end, we produced four test cases and a much clearer source code than the one we started with. As you may have guessed, I am particularly fond of early returns. I actually believe, there is still (at least) one bug. Can you find it?

Let me draw some general rules from this example and lay out the methodology of non-regression testing. A non-regression test is a piece of code that checks for the absence of a particular bug. It should fail in the presence of the bug and succeed in its absence. Tests have various purposes:

robustness tests check the program does not stop abruptly,
functional tests check the program computes the expected value,
performance tests check the program runs within its allocated time and memory budget.

A test is usually composed of three main phases:

a prelude sets up the context and/or prepares the input,
an action triggers the bug,
an assertion checks the expected behaviour.

In many cases, the prelude or assertion may not be necessary. In particular robustness tests obviously do not need any assertion. Tests should be made as concise and simple as possible. A collection of small tests is generally preferable to one large tests crowded with assertions. Tests should be independent from each other and could theoretically be run in any order. However, ordering tests by their date of introduction documents the software building process. Earlier tests tend to break less often as the software matures. In addition, reordering tests from simplest to most complex greatly speeds up later debugging. By the same logic, any test that fails, should be duplicated and reduced to the smallest prefix that breaks before debugging. Since NUnit runs tests in alphabetical order, these principles lead to the following organization and naming conventions:

Test suites should be named after the class they target and prefixed by "SuiteXX_", where XX stands for a two digits number,
Test methods should be named after the behaviour they check and prefixed by "TestXXX_", where XXX stands for a three digits number.

There are various occasions to add tests. The most important rule is to add at least one test every time a new bug is found. Writing tests before coding is also strongly recommended. In a way these tests work as executable specifications. Among others, they help:

evaluate the quality of classes external interfaces,
choose the rôle of each method,
explore scenarios previously identified during design.

In constrast to textual documentation, which may not be renewed as fast as the code changes, executable documentation of this kind remains always true. A few other occasions I can think off the top of my head are :

during code refactoring, you fear to break some invariant, first write some tests,
code learning/review: you must work on a particularly obsure piece of software, for every bit of understanding you painfully acquire, write some tests,
code quality ramp up: use the report of a code coverage tool to write some tests.

That said, writing numerous redundant tests out of thin air in order to reach some tests count target is simply useless. Coverage of real situations will still be low.

When following a non-regression methodology, every bug becomes the occasion to improve both specification and code quality... forever. Thus, after a while, one begins to appreciate bugs for their true worth. In addition to this profound psychological twist, non-regression testing has other impacts on the overall development process. Since at all times, the project runs without breaking its past behaviour, coding becomes a matter of incremental refinement. Project risk is managed almost to the point that the software may be wrapped and shipped any minute.

Centre Pompidou, and it is not a Fluid Catalytic Cracking Unit

On the other hand, non-regression testing also imprints its particular coding style. A bit like structural expressionist buildings, software produced with this methodology tend to expose their internal structure. In fact, to test the lower layer of a software, it is often necessary to be able to call methods, trigger events and check the value of fields that are private. In order to distinguish these probes from standard methods, I respectively name them TestRunMethodName, TestTriggerEventName and TestGetFieldName. However, the visibility of many classes, which should ideally be internal, needs to be elevated to public in order to become testable. This tends to burden a namespace unnecessarily. If you have an elegant solution to this problem, I would really like to hear about it!

To summarize this post:

either during feature addition or debugging, tests should be written before coding,
tests should be as simple and concise as possible,
as I like to repeat ad nauseam to my team members:

1 bug => 1 test

In some later post, I plan to talk about increasing tests coverage with Partcover, explain how to systematically run non-regression tests before any change to a central Mercurial repository and debug the myth of untestable graphical interfaces.

My questions for you today are:

Do you do non-regression testing?
What are your practices like?
Which unit testing framework do you use?
Do you have any idea how to test classes without making them public?