Software Development not at School

Thursday, July 12, 2012

A case against static

Introduction

Some people believe that all constructs of a programming language have their raison d'être. I am of the opposite school of thought: I strive hard to write code using extremely reduced, simple and coherent subsets of the language. Because I know this practice is a major factor of code quality. If I was given the choice to remove only one keyword from the Java programming language, I would have no hesitation whatsoever. It would be the static keyword.

I will not dwell into a detailed explanation of the meaning of keyword static. Only that it comes in four flavours: static variables, static methods, static classes and static initialization blocks. Static variables are associated to the class in which they are declared (contrast this to standard fields which necessarily appear in an instance of the class). They can be roughly understood as global variables (with various degrees of visibility). Similarly static methods do not need any class instance to be called, but can not access non-static fields. Static (necessarily inner) classes are somewhat a different story which I prefer not to tell today. (In short, I despise inner classes and ban them in my coding standards). And at last, static initialization blocks are not really worth mentioning: they are simply bug nests to avoid at all costs.

First let me stress the fact that, from a theoretical point of view at least, any program can be written with no static variables nor methods at all (except for the entry point, which is by definition necessarily static). The proof is easy to sketch: group all static variables and methods as non-static members of one huge class. Initialize one instance of this class at the beginning of the program and then pass this instance all over the place for the other classes to use.

Conversely, note that a Java program with only static variables and methods looks like a C program with modules, but without pointers, structures or unions. Pointers to structures can be somewhat emulated by classes without any methods. Hardcore C programmers must now be grinding their teeth: I am well aware Java does not allow for the power of low level operations such as pointer arithmetic or arbitrary casts (for example from structure to array of characters). Unions however can be recovered with some clever tricks using inheritance.

This brief language analysis underlines the fact that really two orthogonal ways of organizing code are competing within Java:

pure modular procedural programming (only static methods, and non-static variables in lightweight data objects),
pure dependency-injected object orientation (no static variables or methods).

In practice, most programs I have encountered are a mix of these two relatively incompatible paradigms. And it feels. So, I am going to make the case, that writing within the pure object oriented fragment of Java is more productive and leads to more robust code.

Problem statement

Let me start by examining some very small code snippets. Although simple, these examples are representative of the use of static methods in real-world programs.

The most extensive use of static methods is to make a service accessible from any place in the code:

Service.perform();

On the surface, nothing wrong here. However, the truth is that real programs are rarely that simple. In most concrete cases, simply calling a static method such as perform will simply not work. The method will throw some exception, because some internal structure needs to be initialized first. To do so, one needs to call yet another static method first:

Service.initialize();
Service.perform();

This is not the end of the story yet. Method initialize will usually be called once during the setup phase of the application (or maybe worse from another class's initialize method), whereas perform gets called several times, wherever it is required. The two methods, although logically bound, are thus syntactically far apart in the code:

// initialization phase
Service.initialize();
...
// application body
Service.perform();

You may think it is not a big deal. In fact, browsing the documentation of class Service may be sufficient to quickly understand its correct usage. (Even though documentation has an awkward tendency to age very quickly).

But, let us now go just one step further. Consider this piece of code:

Treatment treatment = new Treatment();
treatment.process();

There is a trap in this code. It is hidden from the eyes of the unaware external reader. What matters here, is that somewhere down in the body of method process, lies a call to the same static method Service.perform. This means that method process can be used only under the condition that Service is correctly initialized.

Service.initialize();
...
Treatment treatment = new Treatment();
treatment.process();

Now multiply this pattern many fold, spread it all over your application and you just got yourself a maintenance nightmare. Herein lies the core of the problem with using static: the introduction of hidden dependencies, also called hidden temporal coupling. So called temporal because Service.initialize must be called before and hidden because no indication is visible from the signature of either the constructor of object Treatment or its method process. As a side-note, the book Clean Code, A Handbook of Agile Software Craftsmanship lists hidden temporal coupling among its code smells.


Medusa by Gian Lorenzo Bernini, 1630 Do not let it petrify your code base!

Solution

Now contrast the previous code with the pure object oriented alternative. Instead of calling two static methods, one must first create an instance of the Service. Then method perform can be later called on this instance:

Service serviceProvider = new Service();
...
serviceProvider.perform();

The instance of Service constitutes a tangible proof that initialization correctly took place and a guarantee that method perform may be safely called.

Of course, now, the programmer must do the effort to propagate this instance everywhere he wants to use the service. For instance, the serviceProvider could be passed as a parameter of the Treatment constructor:

Treatment treatment = new Treatment(serviceProvider);
treatment.process();

The dependency is made clear by the signature of constructor Treatment. At first, this coding style may, for some, seem more demanding. However it is really more relaxing. There are no surprises. The programmer can safely rely on the signatures of the constructors and methods to ensure and document all dependencies.

Now that I presented you with the problem and its solution, let me list the multiple reasons why static and in particular hidden temporal coupling is rather bad for your health.

Perseus slaying Medusa
by Laurent-Honoré Marqueste, 1876

The several evils of static

Steep learning curve

Of all the problems related to static, this may seem, in theory, the least worrisome. However, in practice, it will probably consume developpers' time the most. Whenever a coder needs to work on a piece he did not write in the first place, he will have a hard time discovering the implicit dependencies. Unless code is very well documented (and documentation is kept up to date), he will probably need to ask someone more experienced which static methods must be magically called. He will not be able to rely on automatic completion either, since static methods are not carried by the data they operate on.

Costly component reuse

Calling static methods makes code reuse very costly. But that becomes noticeable only at the last minute, when actually trying to extract a piece of code and incorporating it elsewhere. You will have a hard time bringing the result to compile and execute correctly. Only then will it become clear, how heavily the displaced piece of code relied on hidden dependencies. Every single static call must be painstakingly tracked and replaced by some alternative provided in the new execution context. Another solution may seem faster to implement but is even less acceptable: add all the dependencies into the new context. To state it bluntly, highly coupled code can not be cheaply reused.

Increased risks of memory leaks

The deallocation of static variables (and everything that may transitively hold onto them) has to be explicitly taken care of. Programmers familiar with languages having garbage collectors (myself included), tend to easily forget this issue. In our defense, too many details must be handled:

to write a method which releases memory,
not to forget calling it everywhere it is needed.

On the opposite, the memory graph held by non-static variables simply frees when program execution leaves the scope. If the documentation is not accurate, programmers may be totally unaware that a static method allocates permanent memory. Which can easily lead to memory leaks. In this case again, the API lies!

Testability issues

Unless you still apply software development techniques from the previous century, you write unit tests. Static methods and variables make unit tests a whole harder to write:

dependencies must be discovered in order to correctly set the initial state,
memory not released in a test may impact the next one.

Worst of all, static methods do not provide seams. To perform unit tests, it is often the case that dependencies must be replaced by fake implementations. For instance, suppose you are testing a piece of code which emits orders to a printer:

Component c = new Component();
c.process();

Suppose then that method process performs the printing order with the static call sendPage:

void process() {
    Result result = this.doSomething();
    Page pageToPrint = this.presentResult(result);
    Printer.sendPage(pageToPrint);
}

You would rather not empty another printer ink cartridge every time the test suite is executed. However, there is no easy way to change the behavior of method sendPage for the duration of the test only. One way, which I clearly do not recommend, would be to add yet another static method setImplementation to class Printer. Then the test would go like this:

FakePrinter fakePrinter = new FakePrinter();
Printer.setImplementation(fakePrinter);
Component c = new Component();
c.process();

The much more straightforward solution is to have the constructor for Component depend explicitly on an interface of a printer, which may either be the real Printer (in production code) or a fake (in test code).

FakePrinter fakePrinter = new FakePrinter();
Component c = new Component(fakePrinter);
c.process();

Other examples in the same vein could include a logger whose state you would like to check, the queue of a thread runner which you would like not to fill, files which you would rather not create...

Some hard to track bugs

Static methods and variables are source of bugs of the hard kind. Let me simply illustrate with a real case I once stumbled upon. The application had a configuration service implemented with static methods. In order to retrieve the string value of a property, one would call:

String Configuration.getValue(String key);

The service also had an initialization method:

void Configuration.initialize(InputStream file);

Initialization would read all the configuration key-value pairs present in the input stream and fill a hash table. Calling getValue after initialization would return the property configured by the user. However, calling getValue before would return some default value (most of the time adequate but possibly different from the user's wish). Obviously method Configuration.getValue was called all over the place, even in the program initialization phase. So after some code refactoring, I had unknowingly moved a call to getValue method before the initialization phase. This bug was found very late because no regression test was done on this particular value, and everything seemed to work fine with the default value. It also took some time to pinpoint the root cause of the problem.

Without static, this problem can simply not arise:

Configuration configuration = new Configuration(InputStream file);
...
configuration.getValue("some-key");

Simply because one must first hold an instance of Configuration in order to be able to read some configuration value. And the configuration file is necessarily read when calling the constructor. This category of bugs is a classical consequence of hidden temporal coupling.

Similarly, memory leaks may also cause costly bugs, found late in the development cycle.

Architecture erosion

Static methods and variables are by their very nature global: they can be easily accessed from anywhere in the application. Pressed by time, developers may be tempted to use these handy static methods without paying their true cost upfront: carefully thinking about the overall architectural logic. Doing so, they introduce additional, hidden, dependencies. The application architecture quickly decays.

Unnecessary coupling

What you don't see, doesn't bother you... until it hurts you. Hidden coupling is bad, because it is hidden. So you won't spend time auditing dependencies and cleaning them up. With time, unnecessary coupling will undoubtedly increase without you even taking notice. So removing static methods should be a top priority. It will take time. You will discover unexpected, sometimes frightening links in your application. But at least, once dependencies are explicit, you can work on them: move them around, remove some, divide others... In the end you will get minimally coupled tight and focused pieces of code.

Static propagates static

At last, I am under the impression that static leads to more static. This may be caused by the fact that static methods can not call non-static methods or access instance variables. So when a developer needs to extend the behavior of a static method, he may feel stuck. Instead of trying to remove the static method, he may choose the path of least resistance by simply adding more static methods or fields. He will then gradually encounter more difficulties writing truly object-oriented code. For instance inheritance will not be possible. He will make more data public, losing encapsulation. He will write more code like this:

A.fill(b);
A.process(b);
A.print(b);

At this point, he ends up being trapped in a C style of programming where objects are used as passive data-structures.

Acceptable uses of static

For the sake of balance, there are, in theory, some harmless uses of static. They all obey two conditions:

no hidden temporal coupling,
no global mutable state.

In other word stateless. Let me list all the examples that I can think of.

Constants

When final, static variables are acceptable. Strings, integers fall under this category. However, non literal final data-structures (such as hash tables) are not, since their content varies throughout the life of the application. Hand-crafted enums, implemented as several constant objects are also valid. Loggers may be admissible, even though static loggers become a problem as soon as you wish to mock them for testing purposes.

Fresh results

Static methods which return a new result every time are benign. They often provide alternative constructors. For instance the Matrix class may have a default constructor with only zeros as well as a static method identity to build a matrix with its diagonal filled with ones. Careful though, because singletons with global mutable state are only one step away. So, in this particular case I would rather have an instance of a MatrixFactory with a non-static buildIdentityMatrix method.

By the way, about singletons, Misko Hevery wrote a very well-thought and exhaustive piece underlining their dangers here.

Pure methods

Static methods which work only on the state carried by their arguments are also in theory non-lethal. A nice concrete example being the several assertion methods provided by the Junit framework (say for instance Assert.assertEquals). However, most of the time, having such methods is a sign of bad design. The method should be carried by the object it modifies. The only acceptable exception could be final (sealed in C#) objects, whose behavior can not be modified by inheritance.

But even in this case, I would either build a manager or encapsulate rather than add static methods. In C#, there is also always the solution of extension methods. But are they a good thing? For lack of experience with this language construct, I haven't made up my mind yet.

Program entry point

Whether you like it or not, you can't escape the fact that the program entry point in either C# or Java is a static method!

Head of Medusa
by Peter Paul Rubens, 1617
Let her rest in peace!

Conclusion

For the notable exception of the program entry point and literal constants, all uses of the static keyword should be banned. At first static methods and variables may seem convenient; especially for lazy programmers. But the cost is simply too high: hidden temporal coupling will rot away your program. The consequences range from a steep learning curve, decreased reusability, poor testability to rigid design. The alternative is to explicitly trace dependencies with object instances which are propagated through the constructors or methods parameters. In return, the signature of each class naturally documents all its dependencies.

Wednesday, May 2, 2012

Automatic tests and system library: a short riddle

Today, let me offer a slightly different kind of post. For once, I will not give any advice but rather propose a riddle for you to solve.

As I have already mentioned in a previous post, regularly running automatic test suites (à la JUnit) is a guarantee against program decay. With a continuous integration server such as Jenkins, reports periodically built from test results present an up-to-date status of the code condition. Test suites are a barrier against regression. Thus enabling, sometimes extremely aggressive, code refactoring. Some teams even use automatic tests as a non perishable form of documentation, or alternatively as executable specifications. The cost of the manual verification phase can also be reduced, by progressively converting the most repetitive testing scenarios into automatic tests. In short, automatic tests let you attain a surprisingly high level of software quality.

So, it is considered good practice, to at least add one automatic test, for every bug found and fixed. To adopt test driven development may even be more rewarding, but that is another story. However, making a manual verification scenario totally automatic may sometimes prove particularly tricky. This is, in particular, the case for any piece of software which depends on some low level library upon which the programmer has no control whatsoever. Herein lies the crux of today's riddle: how to write automatic tests for code relying on non modifiable external libraries.

Let us consider a concrete Java example:

public class Program {
    public static void main(String[] arguments) throws Exception {
        if (arguments.length < 1) return;
        Program program = new Program();
        program.process(arguments[0]);
    }

    public void process(String fileName) throws Exception {
        ...
        FileInputStream file = new FileInputStream(fileName);
        ...
        int value = file.read();
        ...
        file.close();
    }
}

During its execution (method process), this program opens the file (new FileInputStream(fileName)) whose name is passed as argument to the entry point (method main). It then reads some content from this file (int value = file.read()).

Let us suppose the validation team found the following problem: during one of its run, the program brutally stopped. The call to method read unexpectedly threw a java.io.IOException. This may happen, for instance, when accessing a distant file system which suddenly becomes unavailable because of a severed cable. This is clearly an execution context which is difficult to systematically reproduce at each run of an automatic test.

How would you write an automatic non-regression test which replays this scenario?

Oedipus and the Sphinx

by Ingres, 1808

by Gustave Moreau, 1864

by Salvador Dali, 1960

Saturday, March 31, 2012

On exceptional virtues

Some time ago, a friend of mine let me know about this post by Joel on software. I usually tend to agree with Joel's opinion. But I couldn't align with this piece which is strongly in favor of returning error codes rather than throwing exceptions. So I am going to propose the counter-point and explain why I'd rather use exceptions.

First, let me get this straight: even though I happen to know C quite well, on the contrary C++ is a foreign and feared programming language to me. So please keep in mind that what I have to say applies mostly to languages with garbage collection such as C# or Java. Languages must be understood as a whole and some traits can be incompatible. And so it goes with exceptions too.

I believe programs which use error codes tend to be more verbose and intricate. Consider these two programs:

try
{
    input = read();
    result = process(input);
    output(result)
}
catch (SomeException)
{
    report_error();
}

and

input = read();
if (input == null)
{
    report_error();
    return;
}
result = process(input);
if (result == null)
{
    report_error();
    return;
}
if (!output(result))
{
    report_error();
}

So which one would you rather have?
It can even get worse as half of the population of developers seem to favor functions with a single-exit point. The previous example then becomes:

input = read();
if (input != null)
{
    result = process(input);
    if (result != null)
    {
        if (!output(result))
        {
            report_error();
        }
    }
    else
    {
        report_error();
    }
}
else
{
    report_error();
}

We just made a 10 line function span over 20 lines. So is my brain under-performing? Or does using error codes increase the cognitive load? One thing is for certain. Avoiding exceptions forces you to violate two important programming principles:

Don't Repeat Yourself: because there are duplicate versions of the error handling code.
Single Responsibility Principle: since the logic of the program is interwoven with the management of exceptional cases.

Some C programmers overcome their inbred aversion for gotos in order to mimic exceptions and write code such as this:

  input = read();
  if (input == null) goto error;
  result = process(input);
  if (result == null) goto error;
  if (output(result)) goto end;
error:
  report_error();
end:
  return;

These programmers surely seem to agree with me.

Let us now suppose for a moment we still go for the lengthy version. After all, having more lines of code makes our productivity statistics look good, doesn't it? Now, we are faced with another problem: how are we going to pass some error code for functions which already return some value. We have several alternatives, each of them with its own set of unwanted consequences:

as we just saw in the previous example, we can use null to encode an error,
encode the error codes inside the return value,
or use an out parameter to store the return value of the method.

Obviously, I wouldn't advise using null, since forgetting a check for null most probably results into a null pointer exception to be raised later in the code. And you are back to square one. Additionally, the exact interpretation of null may sometimes be blurry. Let me give you a concrete example from the Java standard API. Interface Map from java.util has a method get which returns the value mapped to a given key, or null if the key is not mapped to any value. However, and herein lies the ambiguity, if the map permits null values, then null may either represent a mapping or no mapping at all. It is then advised to use method containsKey to distinguish between these cases. What an insecure API! It is safer for the method's user to either disallow null values, or adopt the solutions used by the C# class Dictionary: raise a KeyNotFoundException!
The C# Dictionary interface proposes an alternative way to fetch a value via the TryGetValue method. This method returns a boolean to indicate the presence of a mapping and an additional out parameter to pass the value. This practice has its benefit, which is to avoid the sometimes (but rarely) prohibitive cost of throwing and catching an exception. I will briefly come back to this point later in this post. However, in general, it is not a good practice to modify the state of a parameter of a method which has a return value. Indeed, I believe the separation of methods into either command or query as described in the book Clean Code: A Handbook of Agile Software Craftsmanship to be sane. A quote from this book goes:

Functions should either do something or answer something but not both. Either your function should change the state of an object, or it should return some information about that object.

Sometimes the error code needs to contain more information than just a boolean state. Some coders find it smart to encode the error state in the same type than the one used to store correct values. For instance, if a method normally returns positive integers, then negative integers could be used for error codes. This is a terrible practice which more than often leads the caller to not handling the error code at all.

Contrast these convoluted solutions with the simplicity of wrapping any error information into an exception. Plus you get the stack trace for free!

At last, error codes must necessarily be dealt with by the caller. Sometimes the caller does not have anything reasonable to do. And so she must bear the burden of propagating errors upwards. Not only that, but also, she must be careful not to forget handling any error code. Instead exceptions automatically climb the call stack until they are caught. In the worst case, all exceptions may end in a catch all at the top level of the application. Also, in Java, exceptions which do not inherit RuntimeException must be declared in the method signature. No more silly mistakes!

There may be only one reason to prefer error codes over exceptions. And that is performances issues. Even though using exceptions does not change the complexity of your algorithm, they are said to be very costly both in C# or Java. Well, let us put hard numbers on this belief.
I wrote two versions of a program which tries to retrieve several values from and empty dictionary. The first version throws exceptions:

Dictionary<int, int> dictionary = new Dictionary<int, int>();
int count = 0;
for (int i = 0; i < N; i++)
{
    int result;
    try
    {
        result = dictionary[i];
    }
    catch (KeyNotFoundException)
    {
        result = -1;
    }
    count += result;
}

Whereas the second program returns error codes:

Dictionary<int, int> dictionary = new Dictionary<int, int>();
int count = 0;
for (int i = 0; i < N; i++)
{
    int result;
    if (!dictionary.TryGetValue(i, out result))
    {
        result = -1;
    }
    count += result;
}

I ran both programs with the loop upper bound N successively equal to 1, 2, 4 and 8 million! Here are the durations of each run in milliseconds:

N (in million)	1	2	4	8
Error codes	46	78	202	343
Exceptions	111774	221208	442057	882788

Without any doubt, error codes win the speed test, being roughly 2500 times faster than throwing and catching exceptions. However, keep in mind we are experimenting with exceptions in the million. A quarter of an hour to process 8 million exceptions does not seem outrageous. So unless you are in a case where exceptions are both frequent and on a critical path of your program you shouldn't bother. As always, premature optimization is the root of all evil.

To conclude, I hope I convinced you that using exceptions lets you avoid some code duplication, enables separation of concerns, improves methods API and decreases the risk of forgetting error cases.

To sum up, here are the take-away lessons from this article:

always prefer exceptions to return codes,
in Java, try to avoid RuntimeExceptions,
optimize exceptions out only if really necessary.

Dali, The Temptation of St. Anthony

or exceptions propagating up the program?

Sunday, October 30, 2011

Jenkins: trend is destiny

Introduction

Every project is driven by the strong demand for more features. Be it the evil manager (who juggles with three mobile phones to extend unattainable deadlines), the evil boss (who signs your pay-check), or the evil customer (who ultimately funds the project), they all want more and different. Left to these forces alone, any project rapidly decays in contradictory layers of increasing complexity. Hopefully, as experienced developers we all know better: by constantly capitalising our progress with non-regression tests and fighting entropy through code refactoring.

It is a good thing to know where to go, it is even better to know where we stand. And this post is just about that: how to quickly grasp and continuously monitor the global state of your software project. To do so, I will show you how to configure a job in Jenkins. Jenkins is an open source continuous integration server. In more mundane terms, it allows you to :

periodically run tasks on the latest version of the code base,
access the results through a web interface.

Here are some of the things Jenkins can do for you:

build the project,
run all non-regression test suites,
run static analysis tools,
present tool results,
plot metric trends (line count, test count, warning count...),
package the product.

So at all times you get a synthetic view of your project, its general health and the progress being made. Since Jenkins can even package your software into a product, you can almost ship it any time! As I hope to demonstrate in this tutorial, Jenkins has a low initial investment cost (which can moreover be gradually paid) and then runs all by itself. It is for the most part extremely intuitive.

Tutorial

I will be working with Ubuntu (Oneiric Ocelot). Since Jenkins is a Java application, most of the tutorial should still apply with other platforms. However this post assumes you know how to install software packages for your distribution (with apt-get, aptitude or synaptic for instance). In order to install Jenkins the easiest way is to get it from the Ubuntu Software center. With older versions of Ubuntu, you may need to first add Jenkins repository to your software sources and then install with aptitude:

wget -q -O - http://pkg.jenkins-ci.org/debian/jenkins-ci.org.key | sudo apt-key add -
sudo sh -c 'echo deb http://pkg.jenkins-ci.org/debian binary/ > /etc/apt/sources.list.d/jenkins.list'
sudo aptitude update
sudo aptitude install jenkins

This page explains the installation process in a more detailed way. Now, you should be able to access Jenkins web interface by following this link http://localhost:8080/. If this does not work, then maybe you need to start the service first:

sudo service jenkins start

As a running example, I will be using Hime Parser Generator. As indicated by its name, Hime automatically generates C# parsers from context free grammars (much like yacc). But really, the purpose of Hime is irrelevant to the discussion of setting up a job in Jenkins. Hime is written in C#. This will not cause any problem. In fact, Jenkins is language agnostic, even though many of its plugins are targeted for the Java world.

A first build

Let us create our first job. To do so, simply click on New Job from the Jenkins entrance page. Name the job Hime.Parser. The job data is going to be stored on disk in directory:

/var/lib/jenkins/jobs/Hime.Parser

From experience, it is preferable to avoid spaces in job names. Otherwise, some Jenkins plugins may display awkward behaviours. Then select the 'Build a free-style software project' option. Click OK. You will be brought to the configuration page of the newly created job. Let us ignore this page for the time being and simply, scroll downward to click Save. You are now at the main control page of the job. For now, options, listed on the left-side, are fairly limited, basically we can either Delete Projet or Build Now. Let us launch our first build. The first build, build #1, should soon appear with its status (a blue ball) and build date in the 'Build History' box.

Checking out the code

For the second build we are going to fetch the project source code from its repository. Hime code is hosted on codeplex, and it is tracked with Subversion. (even though I would prefer it to be running on Mercurial). Let us go back to our job's configuration page and choose Subversion in the 'Source Code Management' section. Then, we enter the following as its 'Repository URL':

https://himeparser.svn.codeplex.com/svn/Hime/trunk

Next select 'Emulate clean checkout by first deleting unversioned/ignored files, then 'svn update'' as the 'Check-out Strategy'. This strategy is safer than just use svn update as it will start each build with a clean repository, but largely more efficient than always checking out a full-blown fresh copy. Then save and launch a new build. While it is running, you can click on the build and check the Console Output to follow what is going on. You should see something like:

Started by user anonymous
Checking out a fresh workspace because there's no workspace at /var/lib/jenkins/jobs/Hime.Parser/workspace
Cleaning workspace /var/lib/jenkins/jobs/Hime.Parser/workspace
Checking out https://himeparser.svn.codeplex.com/svn/Hime/trunk
A Hime
A Hime/trunk
A Hime/trunk/Exe.HimeCC
...
A Hime/trunk/Tests/Tests.csproj
A Hime/trunk/TraceAndTestImpact.testsettings
A Hime/trunk/VersionInfo.cs
At revision 11108 Finished:
SUCCESS

From the job main page, you can now browse the project source code by clicking on Workspace.

Tracking line count and installing plugins

Now we are going to use Jenkins in order to track a very simple yet important metric: line count. To do so, we will use the external program sloccount and the corresponding Jenkins plugin. Before going any further, let me first explain the Jenkins execution model. Each job execution is split in two major phase: 'Build' and 'Post-Build'. During the 'Build' phase external tools are called and may produce results as files (mostly .xml but any format is possible). Once the 'Build' phase completes, the 'Post-build Actions' execute. Most of them read some result file produced during the previous phase and convert it in visual elements to display in the web interface. Jenkins has a great variety of plugins for build actions, for instance shell, python, ruby, ant (you name it...) scripts, and for post-build actions.
Plugins are managed in the 'Update Center' which can be accessed from Jenkins entrance page by clicking on Manage Jenkins, then on Manage Plugins and at last on the 'Available' tab. For the time being, select Jenkins SLOCCount Plug-in from the list of all available plugins. (If there seems to be no plugin available, then you may need to check for updates manually. This can be done in the 'Advanced' tab). Install and restart Jenkins.
If you now go to the configuration page of the Hime.Parser, you may notice option 'Publish SLOCCount analysis results' which appeared among the 'Post-build Actions'. Tick it. In addition, we must obviously invoke sloccount. In the 'Build' section, add a build step of type 'Execute shell' with the following command:

sloccount --details --wide --duplicates VersionInfo.cs Lib.CentralDogma Exe.HimeDemo Lib.Redist Exe.HimeCC > sloccount.sc

As you may well know, sloccount counts .xml files too. So it is a good idea to run it as one of the first build step. At this point, the workspace is still clear from any .xml result files that may be produced by other build steps.
Save the new configuration and launch a build. A new link SLOCCount, should have appeared in the left menu. If you click it you will get to a sortable table which displays all file sizes. If you build a second time (and refresh the page) then Jenkins plots the graph of total program size per build in the main page of the job. For now it should evidently show a flat trend.

Compiling and building regularly
Again back to the configuration page, this time to set up a whole project compilation. We add a shell script in the 'Build' section which uses mdtool to compile all projects:

mdtool build -p:Lib.Redist -c:Release

mdtool build -p:Lib.CentralDogma -c:Release

mdtool build -p:Exe.HimeDemo -c:Release

mdtool build -p:Exe.HimeCC -c:Release

mdtool build -p:Tests -c:Release

Until now we launched all our builds manually. It would be more convenient to have Jenkins start a build every time the code changes. So we turn to the 'Build Triggers' section. We select 'Poll SCM' and have Jenkins poll the subversion repository every quarter-hour. Following the cron syntax:

*/15 * * * *

If you want more details about the syntax, just click on the question mark next to the field. Jenkins offers a very comprehensive help. Some people prefer to build at regular intervals of time, such as every night. This is done by opting for 'Build periodically'. Also notice the 'Build after other projects are built' which is useful when your software product is composed of several small projects which depend on each other.
Save and launch a build. If you like, you may check the mdtool compiler progress in the Console Output. This time click on trend next to the 'Build History'. This page displays a 'Timeline' with all builds and plots each build duration. As you can check, the last build lasts much longer than the previous ones.

At this point, Jenkins benefit is still modest. However, it will warn you by a build failure every time a code modification broke the compilation process.

Executing non-regression tests

This is probably the most important part of continuous builds: running tests frequently. Remember how we just compiled all projects in Hime. Among them was project Tests which contains all nunit test suites. We will run them with nunit-console and then let Jenkins display the results. First, let us install the Jenkins NUnit plugin from the Update Center. Then go back to the configuration page of the job. Add a new shell in the 'Build' section with the following content:

nunit-console -noshadow Tests/bin/Release/Tests.dll

And, fill the 'Publish NUnit test result report' in the 'Post-build Actions' with:

TestResult.xml

Now build, twice. If you reload the page after the builds complete, a graph with 'Test Result Trends' should appear. This graph plots the number of tests over time. Successful tests are in blue, ignored in yellow and failed in red. If you click on the graph you will browse down the tests hierarchy and see the duration and status of each test.

Displaying Static analysis reports

Back to the Update Center, let us install plugins to display results of static analysis tools. In order to get the list of compiler warnings, the list of tasks (TODO, FIXME, XXX... comments present in code) and the results of Gendarme, we need to install plugins Warnings Plug-in, Task Scanner Plugin, Jenkins Violations plugin and Static Code Analysis Plug-ins.
Then we set up each build step from, as usual, the configuration page of the Hime.Parser job.
For compiler warnings, select 'Scan for compiler warnings' among the 'Post-build Actions'. There is no ad-hoc parser for mdtool outputs, so we choose 'MSBuild' to 'Scan console log'.
Then, for tasks, select 'Scan workspace for open tasks' and fill 'Files to scan' with a fileset representing all C# source files:

**/*.cs

Also, fill 'Tasks tags' with TODO as a 'Normal priority'.

As for Gendarme, it is first run in a shell script with the following command:

gendarme --config gendarmeConfig.xml --xml gendarmeResults.xml Exe.HimeCC/bin/Release/himecc.exe Exe.HimeDemo/bin/Release/HimeDemo.exe Lib.Redist/bin/Release/Hime.Redist.dll Tests/bin/Release/Tests.dll Lib.CentralDogma/bin/Release/Hime.CentralDogma.dll || exit 0

Note that we are using rule filters stored in file gendarmeConfig.xml on the Hime repository. Also, the command needs to be combined with an exit 0, in order not to fail the build on the first Gendarme warning. In the 'Post-build Actions' section, tick 'Report Violations' and fill the gendarme 'XML filename patter' with:

gendarmeResults.xml

We are now ready to save, build (twice again to get trends) and check the result. We get three new trend graphs, all click-able. Clicking on 'Open Tasks Trend' brings a list of all open tasks in the project down to the source file. Similarly for 'Compiler Warnings Trend' and 'gendarme'. Note that the MSBuild parser sometimes fails to extract the correct file name or warning description from the compilation outptut. This is because of line breaks that mdtool inserts in order to keep column-width under a certain limit. I would really like to know how to avoid this bug...

Packaging

The last action of this tutorial will be to add a minimal packaging process to our job. It will simply create a zip file with all the content necessary for end-product delivery. Back to adding a new shell script in the configuration page:

mkdir himecc_latest
cp Exe.HimeCC/bin/Release/* himecc_latest
cp Lib.Redist/bin/Release/Hime.Redist.xml himecc_latest
zip -r himecc_latest.zip himecc_latest

Then tell Jenkins it should present the resulting archive on the main page of the job by clicking the post-build action 'Archive the artefacts'. Set himecc_latest.zip as the name of 'Files to archive'. Let us build one last time and check the downloadable artefact.

Going further
By now, you must have gotten a fairly good grip on Jenkins. Going further is pretty easy and can be done by gradual improvements. For instance, having test coverage trends is pretty necessary and so are code duplication detection warnings....
Here are some plug-ins which may prove useful:

Jenkins disk-usage plugin: lets you monitor the disk usage of jobs. In the configuration page, you can tell Jenkins to discard plug-ins when they are too old or too many.
Downstream build view: necessary when your whole product in split in several small projects,
Plot plugin: to display your own metrics (like the number of tasks in your project tracker for the current deadline),
Green Balls: just better than blue,
Safe Restart Plugin,
Hudson Status Monitor,
ChuckNorris Plugin: I will let you figure out this one by yourself.

Conclusion
I have found that the two most important trends to follow in any project are its line and test count. Test count needs to increase while line count should decrease relentlessly. Keeping these two trends steady, is the best recipe to attain (almost) flawless software. Doing so, you may realise how much trend is, in fact, destiny.

Visualising important code metrics in Jenkins, creates a positive feedback loop: the impact of any code modification can be, almost immediately, sensed and the project stays the course.

Here are some more metrics, that can not, to the best of my knowledge, be tracked in Jenkins and that would be great to have:

More information from the bug-tracker, such as the time taken to complete each issue, the number of implemented and remaining features...
The time elapsed between the discovery of two consecutive bugs,
The ability to sort tasks (TODO comments) according to their age (i.e. the number of commits they have been in existence).

Here is my set of questions for you today:

Do you use a continuous integration server?
Which one do you prefer?
What about sonar?
If you are using Jenkins, which plug-ins do you find helpful?

In a later post, I would like to talk about the following topics:

How to order Jenkins with a script using its command-line interface,
How to write your own plug-in for Jenkins, in particular to follow tests coverage statistics as computed by opencover.

Some metrics for project Gaia. Trend is destiny?

Thursday, June 2, 2011

Testing internal classes: quality and minimal interface at the same time

Today, is a brief entry. I just found a very satisfactory answer to one of my everlasting questions about non-regression testing. Namely, as stated in this previous post, how is it possible to test internal classes while keeping all nunit test code in a separate project?

Well, there is an attribute from System.Runtime.CompilerServices, that lets you do just that. Let's say your test project is called TestSuite, then all you need to do is add the following line to AssemblyInfo.cs in the project under test:

[assembly: InternalsVisibleTo("TestSuite")]

Now all internal classes and methods are visible to the test methods in TestSuite. So you can do deep non-regression testing of an assembly and still present a minimal interface to your users. Pretty neat, isn't it?

If I am not wrong, this problem does not arise in Java. In Java, the notion equivalent to assembly-internal is package-internal. So even though junit code is in a distinct project, it suffices to declare it as part of the same package as the project under test in order to gain full access to all package-internal classes and methods.

As often, stackoverflow has it all: this answer originates from there.

René Magritte, Decalcomania

Wednesday, March 9, 2011

Coding rules for clean and robust C# code

Introduction

This post presents my coding standards for writing clean and consequently robust C# code.
As I stated in a previous post, more time is spent understanding, debugging and reading code than actually writing it. So it is many times worthwhile spending some effort writing clear code upfront.

These standards have thus two main objectives, namely to:

set a limit on code complexity, be it for its size or shape, so that all coding tasks remain always feasible,
establish a normalized way of writing code, so that anybody can quickly understand any part of the project.

In an ideal world, there would be only one best way of writing any piece of code; and the meaning of the code would just be plain obvious at first read. Following the rules listed in this post, is, in my opinion, a step towards this ideal.

I am not claiming this is the only valid set of coding rules, or even that it is complete. However, I have seen many ways to fail a software project and following these rules surely limits risks.
Even though, some rules may seem arbitrary, they are all motivated by, sometimes unfortunate, past experiences. So, if you do not understand the rationale for some rule, please, let me know and I will try to write further explanatory posts.
At last, I tried to order rules by importance. Very often, coding standards are so long that nobody reads, even less applies them. So if you feel this is too much, try to follow the first few rules only. And maybe, in the future, some tool will be able to check all these rules automatically.

Driving principles
First, here are a few general and driving principles:

Keep it simple: try to implement the simplest solution that solves the requirements.
Do not over-design: in other words do not code for yet inexistant but potentially future requirements.
Do not early-optimize: performance problems are solved after they are spot, not when they are imagined.
Remove known bugs before coding new functionalities.
Write at least one automatic non-regression test, before fixing any bug. This is really the minimum. Writing test before coding (full-fledge TDD) is best.
Refactor and clean up code often.

The coding priorities are:

robustness: the code does not unexpectedly stop,
correction: the code does what it is expected to do,
performance: the code is sufficiently efficient both in time and space.

Bojagi, or the minimality of patterns

Size limitations
The cost of all software development tasks grow, non-linearly, with the size of the code. So the most important rule is to try to limit the line count. Here are some limits I like to apply:

1 class per file,
less than 20 000 lines of codes per solution (if not create libraries),
less than 1000 lines per class,
less than 100 lines per method,
less than 150 characters per lines,
less than 10 fields per class,
less than 4 arguments per method.

Please keep in mind these are approximate values that give an idea of the optimal scale for each category.

Forbidden language constructs
Here are language constructs that are absolutely forbidden:

Static methods, except for Main.
Static fields (unless they are private readonly constants).
Nested classes. Why hide some code that could be reused? Also, it is better to have the whole complexity of your architecture apparent.
Goto statements.
Macros, (in particular, regions (#region, #endregion), conditional code (#if (DEBUG))) are forbidden.
The ref method parameter.
The new method modifier.
Operator overloading.
Dead code.
Obsolete code that is commented out. After all, you should be using a powerful source control system such as mercurial (read this and that, if you don't know what I am talking about).
The ternary conditional expression c?e1:e2.

Strongly discouraged language constructs
Here are language constructs which should be used scarcely:

The null value can always almost be avoided. In particular, you should never pass the null value as an argument. It is better to create another variant of the method. Returning null can be advantageously replaced by throwing an exception. In other cases, you can use the null object pattern. When null is used to encode a special case, then document it explicitly.
Casts are forbidden in general. They may be necessary to implement powerful functionnalities, such as dynamic class loading. But then, they must be hidden in thoroughly tested libraries. Their use must be motivated and documented.
Generic classes should be rarely defined, even though their use is encouraged. Use generic classes for only one thing: when you need a collection of elements with all the same type. This practice avoids unnecessary casts. Any other need should be carefully evaluated. In particular, powerfull à la C++ tricks are banned.
Comments should be as few as possible. Special one line comments, starting with TODO: to signal not so important tasks or interrogations are allowed. Nothing worst than paraphrasing comments such as:
```
// Creates an instance of MyClass
this.myField = new MyClass();
```

Reducing and expliciting dependencies

Use the most restricting visibility for classes: internal then public.
Similarly, use the most restricting visibility for fields, properties and functions: private then internal protected then internal then protected then public.
Minimize using directives. Put them all at the start of the file rather than fully qualifying class names in the code.
Access fields through this. rather than prefixing them with m_ or _.
Try to abide by the law of Demeter: avoid successive fields accesses in the same statement.
Minimize and comment the use of delegates and events. They make the control flow harder to follow!

Structuring statements

Initialize fields in the constructors rather than next to their declaration.
Declare all objects, that implement the IDisposable interface, with the statement using. In rare occasions, it may not be possible, in which case you should document how the object will be disposed. Classes with the IDisposable interface are for instance streams and Windows forms.
Boolean conditions should not have any side-effects.

Reduce code imbrication level. In particular, prefer short conditions and early exit, so that the following code:

if (x != null && y != null)
{
    x.DoSomething(y.GetSomething);
}

is rewritten into:

if (x == null) return;
if (y == null) return;
x.DoSomething(y.GetSomething);

Constants must be declared const whenever possible, static readonly otherwise.
Do not use error code, always prefer exceptions.
Ensure that all exceptions get caught at some point. In the worst case, set a general exception handling mechanism at the root of the program. Note this may not be that straightforward because of threads and the message loop of graphical applications.
Adding a logger is a good idea before shipping code.

Process rules
Bug tracker and project management

A bug-tracker such as Redmine should be used.
Tickets must be solved in the order of priority.
Once a ticket is solved, it should be first sent back to its author, who is the only one authorized to validate and close it.

Rules for commit messages

Use a version control system such as Mercurial.
All commit messages are prefixed by one of the following:
[add]: adding of a new feature,
[fix]: fixing a bug (optimizations fall under this category),
[clean]: cleaning up code,
[doc]: for any comment,
[spec]: adding a new test
[merge].
Whenever possible, it is also advised to indicate a ticket number in the commit message.

Organisation of a solution

Every Mercurial repository stores one solution.
Every solution has a project that produces either a library (.dll) or an executable (.exe).
Every solution has a project PrePush that contains pre-push tests, and a project PostPush for lengthy tests.
The central repository compiles and executes all pre-push tests as a hook before accepting any push.
A continuous integration system such as Hudson executes both pre-push and post-push tests. It also triggers the check of downstream solutions.
Every solution has a version number, preferably generated from the source control system number.
For library projects, an automatically generated documentation, with a tool such as Doxygen, is advised.

Non-regression tests

NUnit non-regression tests are mandatory.
The minimum consists in writing a test before fixing any bug. That way at least the same mistake is not made twice.
Better yet, apply TDD. That is, write tests before even starting to write code.

Syntactic conventions
A project should follow syntactic conventions, so that reading the code is easier for everybody. I believe that code should be light, succinct and go to the point. So I tried to choose conventions that remove any unnecessary hindrance to one's understanding.
Note that limiting complexity, compartimentalizing code and writing automatic tests are much more important than focusing on the small details of instruction syntax. So the rules stated in this section of the coding standards are really the least important. However, surprisingly, most coding standards seem to contain several (boring) pages of such rules. Note also, that most of these rules can be automatically checked by tools of minimum intelligence.

Naming conventions

Use camel-case, underscores "_" are fobidden except in test suites.
Project, namespace, class, interface, struct, method and properties names start with a capital letter.
Fields and variables start with a lower case.
Interfaces start with I.
Correct:
struct Data;
int bufferSize;
class HTMLDocument;
String MimeType();
Incorrect:
struct data;
int buffer_size;
class HtmlDocument;
String mimeType();
Avoid abbreviations (except, maybe, for loop counters and X-Y coordinates).
Correct:
int characterSize;
int length;
int index;
Incorrect:
int charSize;
int len;
int idx;
Getters should have the same name as the field they access, except their first letter is capital.
Correct:
int Count { get { return this.count; } }
Incorrect:
int getCount { get { return this.count; } }
Prefix boolean variables by is, has...
Correct:
bool isValid;
Incorrect:
bool valid;
Prefix event names by Event,
Prefix methods that handle an event with On,
Prefix delegate types by EventHandler,
Prefix methods and properties created for test purposes only with Test,
Prefix test methods that trigger an event with TestTrigger.
As much as possible, method names should contain a verb. Try using few different verbs. Avoid fancy verbs whose meaning is not clear to everybody. Try using the same verb for the same concept.
Prefix field accesses by this., rather than prefixing fields name with "_", or "m_".

Spacing

Care should be taken in presenting statements. A code with irregular layout looks messy. It is an indication of the poor care that was put into the code. And, according to the broken windows theory, it encourages further degradations from other coders.

Here are some of my preferred layout conventions:

Each instruction, even short ones, has its own line.
Correct:
x++;
y++;
Incorrect:
x++; y++;
Braces have their own line. This rule applies regardless of the element that the braces enclose: namespace, class, method, property, conditionals... Is this convention better than the java one where the opening brace is on the same line as the keyword? Frankly, it doesn't matter much, as long as you are coherent. But if you really ask me, having the braces on their own lines, makes it easier to visually identify block, plus an if statement can quickly be suppressed with a single one line comment.
Correct:
```
void Execute()
{
    DoIt();
}
if(condition)
{
    DoTrue();
}
else
{
    DoFalse();
}
```
Incorrect:
```
void Execute() {
    DoIt();
}
if (condition) {
    DoTrue();
}
else {
    DoFalse();
}
if (condition)
    DoTrue();
else
    DoFalse();
if (condition) DoTrue(); else DoFalse();
if (condition) DoTrue(); else {
    DoFalse();
}
```
For very short bodies, such as those of getters, everything can be put on the same line. In this case, spaces around the braces are mandatory.
Exemple:
int GetInfo { get { return this.info; } }
For if statements, if the body statement completely holds onto one and there is no else, then the braces are omitted.
Exemples:
if (condition) return;
if (condition) break;
if (condition) x = 0;
if (condition) continue;
Put a space after keywords if, while, switch and foreach.
Correct:
```
if (condition)
{
    DoIt();
}
```
Incorrect:
```
if(condition) {
    DoIt();
}
```
Boolean expressions that do not hold on one line should have their operators aligned to the left.
Correct:
```
if (b1
    || b2
    || b3)
{
}
```
Incorrect:
```
if (b1 ||
    b2 ||
    b3)
{
}
```
Avoid spaces around unary operators.
Correct:
i++;
Incorrect:
i ++;
Put spaces around binary operators.
Correct:
y = (m * x) + b;
c = a | b;
Incorrect:
y=m*x+b;
c = a|b;
Put no space before and one space after a comma, colon or semi-colon.
Correct:
DoSomething(a, b);
class A: I
{
}
for (int i = 0; i < 10; i++)
{
}
Incorrect:
DoSomething(a,b);
class A:I
{
}
for (int i = 0 ; i < 10 ; i++)
{
}
Avoid spaces between the name of a function and its opening parenthesis or between a parenthesis and an argument.
Correct:
DoIt(a, b);
Incorrect:
DoIt (a, b);
DoIt( a, b );

Questions

Here are a few things I would like to ask you:

Does your project have coding standards?
Are they known and followed by the team members?
Do you think it is a good idea to ask for all non-negative integer values to have the type uint? I noticed in practice, the type uint seems to be very rarely used.
What about software costs estimation models such as COCOMO? Can these kind of models bring any value to the development process? How?