Friday, 10 April 2009

Choosing between PDF Clown and iText

I was looking for a Java library to manipulate files using the PDF (Portable Document File) format. The two obvious possibilities seemed to be:

BTW This isn't a detailed comparison but a general impression of the two products (a gut feeling)...Hmm decisions, decisions....

I downloaded both JARs and checked out the documentation. I soon realised that there was a big difference in the quantity of documentation available (iText has a lot more). Which makes sense since Bruno Lowagie (iText creator) has been working on the library since about 1999 while PDF Clown is relatively new.

I started out by installing the Jar files in Eclipse and trying out a simple example. Unfortunately I seemed to have a small problem with the PDF Clown example (SerializationModeEnum.Compact mentioned in the UserGuide didn't seem to exist) while the iText example worked fine.

I also then read the following note about PDF Clown: "This project is young, so that it's to be considered UNSTABLE. Feel free to experiment with it, but DO NOT use it in a production environment (so: beware!)" at: http://www.stefanochizzolini.it/en/projects/clown/downloads.html#License Needless to say it didn't really fill me with confidence for PDF Clown ;-)

On the other hand iText has been used successfully by numerous commercial and open source applications: Macromedia ColdFusion (now belongs to Adobe), Jasper-Reports, Eclipse/BIRT, Google Calendar, etc.

For these reasons I decided to go with iText.

The main inconvenient with iText is that it doesn't properly implement the MVC (Model-View-Controller) pattern, meaning that it doesn't give the possibility to properly separate content and presentation.

The fact that I chose iText does not mean that PDF Clown is without merit: I like the object-oriented approach Stefano Chizzolini has taken. I simply feel that the library needs a bit more time before truly becoming mature.

If following this article people are interested in using iText I would advise to get Bruno Lowagie's book "iText in Action": http://www.1t3xt.com/docs/book.php which is pretty much a definite guide.


Please bookmark, your votes are noticed and appreciated:
Bookmark and Share

9 comments:

  1. Hi Martin

    I'm PDF Clown's lead developer.

    Your review seems quite objective (although, as you fairly explained, it's not an in-depth analysis).

    I'd just like to complete it with some considerations.

    1) Instability
    I have to admit that my disclaimer sounds too much worrisome than should be in common sense. In order to properly interpret the term 'UNSTABLE', you have to consider what it really means in the open-source community: it does not necessarily mean that there are problems — rather, that enhancements or changes have been made to the software that have not undergone rigorous testing and that more changes are expected to be imminent (see [1]).
    I use 'UNSTABLE' to inform users that the library is in development stage, therefore its API could change to harmonize with its future evolution and so I cannot guarantee they will have no need to update their own code to keep up with next releases. Anyway, PDF Clown's evolution is typically incremental (NOT disruptive), so no one has to worry about major changes that may cause their code to be thrown! :-)
    Furthermore, I wanna stress that I always meticulously take care to release only well-tested versions, excluding any common-case bug. Actually, reported bugs have typically been about fringe cases, i.e. particular uses that touched the implementation frontier (the 'hic sunt leones' of the library domain, the boundary between what has already been implemented and what hasn't yet).

    2) Documentation
    As you pointed out, PDF Clown's documentation is still partial: I appreciate any suggestions about the most important topics to treat from a user's point of view.

    3) Examples
    Whenever you stumble upon some problems, please let me know applying to the public forums (see [3]) so that I can correct them or suggest you a correct approach. I take into consideration any user request.

    4) Comparison
    PDF Clown is much younger than iText, so it's just a matter of fact that it's not as mature; on the other hand, its newness allows it to stand on a vantage point, taking care to avoid some design and implementation pitfalls that may have encountered previous projects.

    Despite both dealing with the Portable Document Format, PDF Clown and iText originate from different philosophies and approaches.
    iText was initially conceived as a multi-format (PDF along with HTML, RTF etc.) generator, analogous to some other efforts like standard XSL-FO engines (see Apache FOP [2] as a popular implementation); later it was retrofitted with editing capabilities, such as encryption, annotations and a whole bunch of nice things.
    PDF Clown has been designed from scratch to smoothly combine generation, reading and editing capabilities framed inside a cohesive, robust and flexible model. Just to mention latest developments, I'm currently working on the text extraction capability (you cannot find it in iText) that will allow users to retrieve page text along with rich information about its graphic location (coordinates) and style (font, font size, font color and so on). This isn't a retrofit: it's just a coherent result of what I envisioned since the beginning, as it works upon a common versatile set of layers which serves disparate functionalities.
    Flexibility and simplicity reveal themselves also when you need to extend the library; I can cite, for example, the case of an IT Senior Consultant who needed to extract some file attachments from a pdf document but found a particular codec filter was missing from current PDF Clown's implementation. Well, it took him just a few hours to figure out how to implement a new codec filter and send me his contribution (by the way, it will be part of next release (0.0.8)).

    This is obviously not the appropriate context to apply a neutral critique on the respective merits of PDF Clown or iText; I just wanna suggest users to compare the consistency of their object models, their flexibility and the cleanliness of their designs, looking from both a black-box perspective (the API usability) and a white-box perspective (the code implementation behind the API).

    Concluding, due to its maturity, iText is undisputably winner about the richness of its feature set (for example, you cannot find encryption or digital signature support in PDF Clown); nonetheless, if you are also concerned with solution elegance (for example: fully object-oriented traversal of pdf document contents and metacontents to perfom advanced reading/editing operations without awkward tricks), I suggest you get PDF Clown a try.

    Thank you
    Stefano

    [1] http://en.wikipedia.org/wiki/Software_release_life_cycle#Stable_or_unstable
    [2] http://xmlgraphics.apache.org/fop/
    [3] http://sourceforge.net/forum/forum.php?forum_id=607163

    ReplyDelete
    Replies
    1. Hi Stefano,

      Your reply is informative.

      I would like to take this opportunity to ask for a suggestion/help on one of the issue I am facing with pdf highlighting.

      I am trying to highlight a phrase in pdf file. I have been trying hard, like making regular expression of the Phrase, and finding it. Any suggestion would be very much appreciated.

      Thanks.

      Delete
  2. Correction: iText has recently added a text extraction feature.

    Stefano

    ReplyDelete
  3. Hi Stefano

    It is a great honour to get a comment from one of the two original authors :-)

    My comparison was based on the present state of the two libraries. The fact that for example PDF Clown has not reached version 1.0. With time I am sure that choosing between the two will become more and more difficult.

    I have got to say that in both cases I was touched by the generosity of the approaches and believe that both libraries are very useful to the Java community.

    ReplyDelete
  4. I am use to iText and feel that it serves the purpose very well although I never used PDF clown but from your blog i can get a good impression of this.I too feel that drawback of iText is that it doesn't properly implement the MVC (Model-View-Controller) pattern.
    electronic signature pdf

    ReplyDelete
  5. Hi, the problem if iText is also in its license. Now it uses AGPL licence. And the commercial license is a little bit expensive. Many open source tools like JasperReports do have to use old version of iText (2.1.7) since it was the last release under Apache 2.0 license. So, the Apache version iText is a kind of "unmaintained".

    PDF Clown can be a iText replacement for those who do not want to use AGPL iText version.

    ReplyDelete
  6. Wow, I can't believe Stefano commented on your post! Do you even realize how cool that is? He's like one of the leads in converting xml to pdf. I wish I could invite him to my birthday party.

    ReplyDelete
  7. This discussion was started on 2009. It is 2016. So I thought I might add something.

    PdfClown 0.1.2 is great. It is better than iText# 5.5.6 in respect to text extraction facility, which I used mostly.

    You could convert a Pdf to Html in almost exact replication.

    ReplyDelete