Wednesday, July 30, 2008

Update to DecentXML

I've updated my XML parser. The tests now cover 97.7% of the code (well, actually 100% of the code which can be executed; there are a couple of exceptions which will never be thrown but I still have to handle them) and there are classes to read XML from InputStream and Reader sources (including encoding detection).

The XMLInputStreamReader class can be used standalone, if you ever want to read an XML file with the correct encoding.

You can download the sources and report issues in the new Google Code project I've created.

Information Management With Zotero

I've been long looking for a nice tool to manage my vast extra-brain information collection, i.e. the stuff that I don't want to save in my long term memory. Web snippets, notes, that kind of stuff. All the usual solutions didn't appeal to me. Either I was locked to Windows or to a single computer or the UI was bad or the feature list lacked some important points.

Zotero to the rescue. This beast is advertised as "Zotero [zoh-TAIR-oh] is a free, easy-to-use Firefox extension to help you collect, manage, and cite your research sources. It lives right where you do your work — in the web browser itself."

Which makes sense. I watch most of my information in my web browser, so why no collect it in there, too? The UI is nice, I'm just missing a few features. Also being able to sync with my own server would be nice. But I'm sure that will be fixed, soon. In the mean time, I can at least tag and order my snippets.

Tuesday, July 29, 2008

A Decent XML Parser

Since there isn't one, I've started writing one myself. Main features:

  • Allows 100% round-tripping, even for weird whitespace between attributes in elements
  • Suitable for building editors and filters which want to preserve the original file layout
  • Error messages have line and column information
  • Easy to reuse
  • XML 1.0 compatible

You can download the latest sources here as a Maven 2 project.

Monday, July 28, 2008

DSLs: Introducing Slang

Did you ever ask for a more compact way to express something in your favorite programming language? Say hello to DSL (Domain Specific Language). A DSL is a slang, a compact way to say what you want. When two astronauts talk, they use slang. They need to get information across and presto. "Over" instead of "I'll now clear the frequency so you can start talking." And when these guys do it, there's no reason for us not to.

Here is an artical on Java World which gives some nice examples how to create a slang in Java and in Groovy. Pizza-lovers of the world, eat your heart out.

FREE! Really.

I just found a nice comment under my blog. It offered a free service. One sentence was: "REGISTRATION IS ABSOLUTELY FREE!" When you see that, you know you're being ripped off. I'm not mentioning the name of the guys who tried that stunt in order to give them no additional advertisement. 'Nuff said.

Tip: If you want me to join your planet or RSS mega feed or whatever, it's not smart to post a comment in my blog. This is my blog, my reputation, my honor. I decide who gets free advertisement here.

Saturday, July 26, 2008

Testing With Databases

When it comes to testing, databases are kind of a sore spot. People like to think that "you can't test when you need a database" or "it's too complicated" or "it's not worth it." I'd like to give you some ideas what you can do when you need to test code that depends on a database. This list is sorted in the order in which I try to tackle the problem:

  1. Use POJOs to store the data from the database in the real code and for the tests, create some dummy objects with test data and use them.
  2. Make the database layer a plug-in of your application and replace it with a mockup for testing that doesn't need the database and which returns test objects instead.
  3. Instead of connecting to the real database, get HSQLDB or Derby and use an embedded or at least local database. I prefer HSQLDB because it's smaller and starts faster (and tests should always be fast) but Derby has more features.
  4. Create a second instance of the production database system on a different machine, preferably your own computer.
  5. Create another instance of the real database with test data on the same machine as the real database.
  6. Use database schemas to create a logical database in the real database, for example if all tables are in the schema APP, create APP_TEST and in your code, add a way to replace the schema name in the SQL statements. If you wrote the DB layer yourself, use a system property which isn't set in production. If you're using Hibernate, walk the mapping objects which are created and replace the table names after loading the production configuration. Field.setAccessible(true) is your friend.

If you can't decide, here are a few hints:

Creating two databases using schemas in the same instance can get you into serious trouble without you noticing. For example, the tests should be able to rebuild the test database from scratch at the press of a button so you can be sure in which state the database really is. If you make a mistake with the schema name during that setup, you'll destroy the real database. You might not notice you did, because the flawed statement is usually hidden under a few hundred others.

Installing a second instance on a different machine might trigger license fees or your DB admin might not like it for some reason. Also, a test database should be very flexible because you'll need to be able to drop and recreate it a dozen times per hour if you need to. Your DB admins might not like to give you the necessary rights to do that. Lastly, this means only one developer can run all the tests at any given point in time because you're all going against the same database. This is bad, really bad. More often than not, you'll have spurious errors because of that.

If you can legally get a copy of the real database on your own machine, that's cool ... until you see the memory, CPU and hard disk requirements plus a DB admin will probably hog your machine for a day or two to install it. Having to run two applications which need 1GB of RAM (your IDE and the DB) with a machine that has only 1GB of RAM isn't going to fun.

For many cases, using HSQLDB or Derby is a good compromise between all forces that pull at you. While that will make your tests slow, they will often run much faster than against the real DB. You can install these as many times you like without any license issues, fees or DB admins bothering you. They don't take much memory or hard disk space and they are under your total control.

Only, they are not the real DB. There might be slight differences, hidden performance issues and other stuff that you won't notice. Don't bother about that, though. If you can test your application, you'll find that you'll be able to fix any problems that come up when you run against the real database in little time. If you can't test your application, thought, well, you're doomed.

I strongly recommend to be able to setup the database from scratch automatically. With Derby, you can create a template database and clone that on the first connection. With HSQLDB, loading data is so fast that you can afford to rebuild it with INSERT statements every time you run the tests.

Still, test as much code as possible without a database connection. For one, any test without a DB will run 100-1000 times faster. Secondly, you're adding a couple more points of failure to your test which are really outside the scope of your test. Remember, every test in your suite should test exactly one thing. But if you can't get rid of the connection, you're testing the feature you want plus the DB layer plus the DB connection plus the DB itself. Plus you'll have the overhead of setup, etc. It will be hard to run a single test from your suite.

At the end of the day, testing needs to be fun. If you feel that the tests are the biggest obstacle in being productive, you wouldn't be the good developer you are if you didn't get rid of them.

One last thing: Do not load as much data as possible! It is a common mistake to think that your tests will be "better" if you have "as much data as possible". Instead load as little data as possible to make the tests work. When you find a bug, add as little data as possible to create a test for this bug. Otherwise, you'll hog your database with useless junk that a) costs time, b) no one can tell apart from the useful stuff and c) it will give you a false feeling of safety that isn't there.

If you don't know which data is useful and which isn't, then you don't know. Loading of huge amounts of junk into your database won't change that. In order to learn, you must start with what you know and work from there. Simply copying the whole production system will only slow you down and it will overwrite the carefully designed test cases you inserted yesterday.

Nexus, a Maven Repository Manager

If you're using Maven in a corporate environment, then you're struggling with the same problems all over again: How to make sure that the build builds?

While a simple task at first glance, there are a few hidden obstacles which boil down to two things: Downloads via the Internet and plugin or dependency version stability. Both can be solved by a using a proxy or a in-house repository.

The guys from Sonatype have been busy in the last months and have released Nexus 1.0.0-beta-4.2 which gives you another option to chose from besides Archiva or DSMP (my own Maven 2 proxy). I've tried Nexus yesterday and I have to say that I'm very pleased with the result. As usual for Open Source Software, the beta is more stable than some post-beta commercial products and it delivers with very little setup (follow the link to see the documentation).

Now, we have a second issue: version stability. Here is my recipe to achieve that. First of all, version anything in your POM. All dependencies, all plugins, everything. I'm using properties for that which I define in a common parent POM plus I'm using the dependency management. Maven 2.0.9 helps a lot here because it forces you to add version elements everywhere.

The next step is to make sure the maven builds can find their stuff. To do that, I suggest to set up two Nexus repositories. The first one is the "build" repository, the second one is the "cache" repository. While all developers should use the "build" repository, the "cache" repository can actually download dependencies from the Internet.

The "build" repository, on the other hand, is just a local repository with no Internet connection. To avoid mistakes, I suggest to install the build repo with the default settings but with all remote repositories deleted or turned into local ones. The "cache" repository should run on an unusual port and with the remote repositories enabled as described in the installation documentation.

Next, you need to create a profile in your settings.xml which switches mirrors between the two. When you want to check out a new version of some plugin, switch to the cache repository and have it download all the new stuff. This will pollute your local copy of the maven repository but only yours. After you have verified that the build completes (or fixed all the problems you've got), check the RSS feeds of Nexus for stuff it downloaded. Then, all you have to do, is to copy those to the "build" repository. After a refresh, all the other developers in your company can use the new, verified downloads.

Clean your local repository and build again to make sure that your colleagues won't have any problems after the change and you're set.

Wednesday, July 23, 2008

Management Is The Art of Choosing What Not To Do

From Rands in Repose: "... management is the art of choosing what not to do ..."

If you want to know more about management told in a way an engineer can understand, consider Rands' book "Managing Humans".

Tuesday, July 22, 2008

The Code Reuse Myth

The post "The Code Reuse Myth" by James Sugrue got me thinking.

The main problem with code reuse is that our programming languages don't support it. We sacrificed this to the gods of efficiency four decades ago and, while a few people dared to question the practice, all of them were struck down by unexpected lightning out of the blue, so far.

As James said, "context" is the keyword. What we'd need is a programming language where you can adapt concepts to a context that goes beyond "object instance" or "class" or "application". What we need is an efficient way to say "collect all objects and sort them by ID" where the context defines what "ID" is. What we need is an efficient way to describe a model (relations of basic data types) and then have some tool map that efficiently to reality so we can reuse parts or fragments of the model in different contexts.

OO can't do that because it's limited to factories and inheritance. Traditional preprocessors can't do it, because they can't see the AST. Having closures in a programming language is the first, tiny step in the direction to be able to push external context into existing code. This allows us to put the code to access the database into a library and influence what it does per row of the result with a closure.

But to be a real new paradigm, we need "closures" in data types as well. This means being able to reuse fragments of code and data structure definitions in a new context. These fragments need to pull along all the algorithms and structures they need without the developer having to pay close attention what is going on until the point where the result needs peep hole optimization because of performance issues.


fragment Named {
    String name

    String toString () {
        return "name=${name}"

class File : Named {...}

class Directory : Named, File {...}

Looks simple but with OO, this will get you in a lot of trouble: Directory gets a name from the class File and from the fragment Named (this is an artificial example, bear with me). Which one should the compiler chose? In OO, I can't say "I don't care" or "Merge them".

With real fragments, you could say "Directory is a File in the sense that it supports a lot of the operations of a file (like rename, delete, get last modification time) but not all (you can't execute a directory or open it for reading)." So the example would look like this:

class File : Named { 
    void delete () {...}
    Reader openForReading () {...}

class Directory : Named, File.delete {...}

Now, a Directory has a name and it "copies" the method "delete" from File along with anything this method would need. Internally, the compiler could create an invisible File delegate or it could clone the source or byte code of the File class, or whatever. Or, even better, we could say "give me a copy of File.delete() but replace all references to the field File.path with Directory.path."

The main goal would be to allow to use the compiler as a cut'n'paste tool which checks the syntax and allows me to say "copy that method over there and replace xxx with yyy". Because that's why we think that code reuse could work: We see the same code over and over and over again and each time, the difference is just a tiny little bit of code which we can't "patch" because the language just doesn't allow it.

Lovin' Linux? Dig This!

Want to make linux better? Ask the Linux Hater. If in doubt: He wouldn't write 15 articles per month telling where Linux sucks if he didn't care.

Monday, July 21, 2008

How to Cure a Fanatic

Like many people, I've always been wondering how the Jews, barely escaped from being extinct, can behave like they do in Israel and Palestine today. It seems, some of them wonder as well. One of them is Amos Oz who has written a wonderful book about fanaticism: How to Cure a Fanatic.

If you don't understand that I'm arguing against violence here, get the book and read it.

According to the book, a fanatic is a person who cares so much about you that he'd rather kill you than let you be miserable.

Oddly, this makes sense. Fanatics want to make the world a better place -- at all cost. In the second chapter of the book, Oz tells a short story why this doesn't work. He does that in a way that even a fanatic might understand (translated into English by me; all mistakes are mine).

A friend of Amos Oz, the Israeli romancer Sami Michael once made a long trip in a car. During the ride, the driver gave him the usual lecture how important it was for the Jews to kill all Arabs.

Instead of harassing this guy with "What a horrible man you are! Are you Nazi? A Fascist?", Sami listened. He had decided to try a new approach and he asked the driver: "And who, in your opinion, should actually kill all the Arabs?"

The driver replied: "What are you talking about? We! The Israeli Jews! We have to! We have no choice, just look at what they do to us every day!"

"But who exactly should do the job in your opinion? The police or maybe the army or the fire brigade or a team of doctors? Who should do the work?"

The driver scratched his head: "I think it should be spread among us. Everyone should kill a few."

Sami went along with the game. "O.K., I assume you will pick an apartment building in the capital of Haifa, you ring the doorbell or you knock on every door and you say: 'Excuse me, dear Sir or Madam, are you an Arab by any chance?' And if he or she should reply with 'yes', you will shoot them. Then, you just finished your block and want to go home, just then, you hear a baby cry somewhere on the third floor. Would you go back and shoot the baby? Yes or no?"

There was a moment of silence, then the driver said to Sami: "You know, you are a very cruel person."

Now, if your feel anger or disgust, you didn't understand the point of the story, so get the book and read it. For everyone else, think about it. You'll be surprised how many levels of understanding this simple story has and how well it explains the reasons and the fundamental flaw of a fanatic.

Disclaimer: No humans and no animals were harmed, tortured or killed for this blog entry. Only my cat is now mad at me because I dared not to devote her my full attention while I wrote this.

Toying With Swing

I've been toying with Java Swing (the UI which comes with Java in case you're wondering) a bit lately to determine which UI to use for my ePen project. I'll post a longer article about my findings in the next few days but for now, just a few links I've collected:

From LegHumped:

Then, there is the JFC Swing FAQ and of course the Java Swing Trail by Sun.

I've been looking for a good source on editor components. Swing Hacks looked promising but it seems to only scratch the surface like the rest.

Wednesday, July 16, 2008

Docs? Ask The Sphinx

If you need to generate docs for your Python projects, try Sphinx.

Tuesday, July 15, 2008

There Are Two Kinds of People ...

"... Those who separate people in two kinds and those who don't." But I digress.

Ever wondered about the wars in IT? VI vs. Emacs? KDE vs. Gnome? IntelliJ IDEA vs. Eclipse? PC vs. Mac? Why can't people pull along the same string for once?

Well, because they can't. Duh. We all have differences and we find these to make our life more rich or more simple. Can't discuss with a guy who always agrees with you, can you? Or just image your better half to do as you do ... you couldn't even out your advantages and disadvantages! During work, we accept that people are different and we find that useful because it means that work can be spread and people can do what they're good at (instead of where they suck).

Sometimes, this difference goes deep. Way deep. It's so fundamental to our personality that we don't even question this. That's the fundamental schism which fuels the wars in IT. There are "vi" people and there are "emacs" people. Each member of both groups thinks the others are imbeciles who just won't see the light, no matter how often they got beaten some brains into them.

The "vi" people wants to get things done and they don't want the tools to get in the way. A tool should be like a hammer: Simple, to the point, easy to understand and use. If it comes with a manual, it's not a tool, it's a distraction.

The "emacs" people, on the other hand, like to have the most powerful tool they can find at their fingertips. They want to abstract, hide, build tool layer upon tool layer until the task at hand literally happens at the press of a button. If the tool can't be customized, it's not a tool, it's a nuisance.

No matter how much you wish for it, these two kinds of people will never use the other ones tool. If they have to, they will be constantly irritated. Take IntelliJ IDEA, for example. I'm a "vi" guy and this IDE just freaks me out. It's always doing something with my source that I never told it to do, always reformatting, always adding and removing whitespace, always getting in my way. I hate it.

Eclipse, on the other hand, comes with a rich tool set. I can have my source formatted any way I like, but it only does so when I tell it to. The default is to leave my artwork alone. Eclipse doesn't try to be smarter than me. Eclipse gets my jobs done when I want them to be done and it doesn't get in my way.

Don't get my wrong. I'm not telling you why Eclipse is better for you than IntelliJ, I'm saying it's better for me. I'm a "vi" guy.

Now, you may argue that I could probably hack IntelliJ into doing what I want. That's my point exactly: If I have to turn IntelliJ into an Eclipse clone to be able to use it, why not use the tool which fits my hand to begin with? And let's face it: No matter how customizable a tool is, after you've turned it into a clone, there will still be a lot of corner cases.

These come from the core of the design of these tools and that's what makes them as fundamentally different as two humans and no argument in the world will change that.

So, next time someone comes up and says "This or that would be better", answer: "It is better for you.. How about me?"

Friday, July 11, 2008

Fastest Way to Collecting Objects in a String

The fastest way to collect a list of objects in a String in Java:

StringBuilder buffer = new StringBuilder ();
String delim = "";
for (Object o: list)
    buffer.append (delim);
    delim = ", "; // Avoid if(); assignment is very fast!
    buffer.append (o);
buffer.toString ();

Wednesday, July 09, 2008

Are Bad Tests Worse Than No Tests At All?

In his article "Are Bad Tests Worse Than No Tests At All?", olivstor writes:

Are the drawbacks to bad tests worse than having no coverage at all? I think the answer is that in the short term, even bad tests are useful. Trying to squeeze a extra life out of them beyond that, however, pays diminishing returns. Just like other software, your tests should be built for maintenance, but in a crunch, you can punch something in that works. It's better to have bad tests than to have untested code.

Tests are like any other code: They can go bad.

In my career, I've found that it's surprisingly hard to write good tests if you have no experience in doing so. People starting to write tests make them too complex, too long, let them have too many dependencies and they take too long to run.

If you're in such a situation, you have to face the fact that you just programmed yourself in a corner and you must spent the effort to get out of there. Tests are no magic silver bullet. They are code and follow the usual rules of coding: When it hurts, something is broken and it won't stop hurting unless you fix it.

So in this sense, I agree that bad tests are better than no tests because they tell you early that you need to fix something. That's what their core purpose is.

Management might argue that you're spending too much time on testing. I've never had a problem to sell myself to them. I usually figure that I spend 50% of my time (or more!) writing tests and 50% actual coding - and I'm still much faster than those who write code 80% of the time or more. What's more: when my code goes into production, it's is rock solid or at least easy to fix when something comes up. In 99% of the cases, the things I need to fix were those which I didn't test.

This is a positive reinforcement loop which drives me to test more and more because it stops the hurting. If your tests cost more than they seem to return, you need to fix them until you get the same positive feeling when you think about them.

Tuesday, July 08, 2008


While doing some work with MQSeries, I got an error "MQJE001: Completion Code 2, Reason 2045" in MQQueueManager.accessQueue() which translates to "MQRC_OPTION_NOT_VALID_FOR_TYPE". Hm. Hey, IBM, how about adding real error messages to your products instead of having people look up odd codes in tables?

Anyway, the error means that I'm trying to open a queue for output which doesn't support this. For example, remote queues can't be opened with the option MQC.MQOO_OUTPUT. Other queues don't allow to read from them, i.e. you have to get rid of MQC.MQOO_INPUT_AS_Q_DEF in the openOptions.

TurboGears 2.0 Is On Track

There are three things which hooked me to TurboGears:

  1. Every day stuff is simple, complex stuff is possible
  2. Automatic reload after code change (no need to restart)
  3. It's in Python

What I didn't like is that TG 2.0 has been so quiet for so long. I'm on the Planet Turbogears RSS feed and I wasn't sure whether 2.0 was alive or dead or whatever.

Well, it seems to be more alive than I expected and hopefully, we'll see a 2.0 soon. In "Doing the right thing should be easy" by Mark Ramm, you can find more details.

Starting Your Own OSS Project

If you're planning to roll your own little OSS toy project, you should read the article "Party of one: Surviving the solo open source project" by Kirill Grouchnikov. Very good points on what to do and what to avoid and why.

Genes in Wikipedia

So if you ever wanted to know how that stuff works that you're made of, scientists have started to put their knowledge about genes on Wikipedia. Beware, though, it's heavy stuff.

It's interesting to see how you can use an automated process to merge complex information from one system (the scientist's databases) into another. Now, I'm waiting for the "translate this goo into language xyz" bot :)

Sunday, July 06, 2008


Strange movie. Here in Switzerland, it's sold as a "comedy" but it's not, and people will be disappointed. Also, I'm unhappy about the amount of futile violence and gore in the movie. There are a couple of scenes where you'll sit in you chair and think "What the f***!?". This is bad. While in the theater, you should never realize that you're watching a movie.

All in all, I think that the movie failed to deliver because it couldn't explain enough. Maybe it was too short or maybe the wrong scenes were chosen, I don't know. I left the cinema with a strange feeling of confusion, things just didn't add up. Unlike in other movies, I'm not able to say what they could have done different. After the big surprise in the second part of the movie, the behavior of the characters is suddenly consistent and you know why Hancock is such a bastard. Only, I don't know, it's as if something is lacking.

Hancock is shallow and that fits for a comic character but he's more a tragic character and this doesn't add up for me. So in the end, even when he finds his only love and gets killed for her, I don't really care anymore (as much as you care when Garfield gets flattened by a door).

See what others have to say.