Sunday, February 1, 2009

Using the Best Tools in Programming: Not Really Doable

There's something that bothers me when it comes to starting a new project. You can't really use the best tool for a certain job, if that tool is not integrated with the rest of your platform. Let me explain.

At our startup we pride ourselves with our pragmatism. We are true polyglots :) capable of diving in any project, no matter the language it was written in. This also gives us the power to make educated choices about the technologies we're going to use for our own gigs.

Our programming language of choice is Perl, because of its flexibility and because usually there's no need to reinvent the wheel since you can find a CPAN module for almost anything.

But recently I began experimenting with data-mining techniques, flirting with various NLP libraries. You can find almost anything in CPAN's AI:: namespace. But I also knew about NLTK, a Python collection of libraries with excellent documentation, and I also found OpenNLP, MontyLingua, ConceptNet, link-grammar and various Ruby modules.

And all of a sudden I got cold feet. Java packages in OpenNLP may have the advantage of speed (just a guess and it doesn't matter for the purpose of this discussion). NLTK has pedigree and great documentation, not to mention that many books related to NLP, AI and data mining have Python samples (for example I own Programming Collective Intelligence and AIMA). Usually the solution is straightforward: you test all the options, and choose the best one.

But what if you want to combine them?

Well, then you're shit out of luck. Surely you can do that with inter-process communication, but for that you'll have to write glue-code and pay the price for extra latency, bandwidth and memory ... parsing millions of documents, moving results between processes, it's not really practical. Perl does have Inline::Java, but I would only use it in extreme situations.

That's why there's so much wheel reinvention around. Unless a module is written in C, for which any language has a FFI, almost nobody wants to use a Java module from Ruby, or a Python module from Perl. That's why there's Lucene, and then there's Lucene.NET, CLucene, Ferret, Zend_Search_Lucene, Plucene and Lucene4c.

What is really needed is a universal virtual machine with a flexible MOP, allowing seamless communication between languages. I'm happy there are a couple of efforts in this space, including Parrot, and the DLR. Also, the biggest obstacles of alternative implementations are the modules written in C. Fortunately, JRuby/Rubinius have a brand new implementation-independent FFI, and Ironclad will allow IronPython users to use CPython extensions (number one on their list being numpy).

These developments make me happy :)

11 comments:

Martin C. Martin said...

The big advantage of Groovy is that it can seamlessly use any Java class. No need to write a wrapper to translate between two different object systems with different semantics. So you get all Java code for free.

Of course, it still doesn't give you Perl or C/C++ code. Ruby and Python code might be possible through JRuby and Jython.

So, you can consider OpenNLP, MontyLingua and the others to be both Java and Groovy libraries. It's quite a leg up.

Anonymous said...

Try jpype (http://jpype.sourceforge.net/). I've used it recently to call Java from CPython.an

StoneCypher said...

Relatively easy to glue those together using Erlang, actually. Take a look.

Alex said...

Someone posted anonymously in a rude manner that I should use Inline:Java. Of course I deleted the comment since I'm usually trying to have grownup conversations :)

I already mentioned Inline::Java, but I don't like its implementation. I would rather expose a restful web service from the foreign module I'm trying to use.

Eugenio Ciurana said...

Great post!

In terms of integration: take a look at the Mule platform. You can use that to glue applications together with minimum effort since it supports all kinds of transports out of the box, and transformation tools. You can easily have a Perl module talk to a Java module over any combination of transports (raw TCP, HTTP, JMS, in-memory) and apply transformations "in flight" that simplify how much gluing you need to do.

This will let your development and testing effort focus on connecting and transforming things without having to regression-test every component in the system. It also lets you give a faster turn-around for code, and faster mashing up of legacy and new applications.

Cheers!

E
http://eugeneciurana.com
http://istheserverup.com

Anonymous said...

The problem with a "universal VM" is that at the language level, there's absolutely no agreement about what types (which would become the interfaces) are or mean. Do you have an explicit bool type, or do you use ints 0/1, or is the empty list false, or is any empty data structure false, or just a couple, and can the user define new false-predicates? (And that's just bools!) Then there's other built-in types, and user-defined types and classes (are they the same?) and methods.

What about a library that took advantage of syntactic abstraction? Could you use a Scheme macro from Perl? (I don't even know how to use a Scheme macro in Common Lisp!)

It always sounds cool to think about a "universal VM", but since programming languages are how programmers communicate, it doesn't seem much more feasible than trying to get everybody to speak Esperanto.

x86 is (perhaps unfortunately) pretty darned universal. Do we gain anything by using another VM?

But all this underscores your main point, which is that you can't get there from here.

Alex said...

@Anonymous, it doesn't matter how a boolean value is represented at the language level, but at the VM level.

And you don't need to export everything.

If the language uses esoteric types, you only need to export a usable interface, with methods that receive and return standard types, recognized
by the VM.

Also, people focus too much on the extremes, like Haskel, but lets face it, the mainstream languages are pretty similar ... I don't see many differences between Java, C#, Ruby, Python, Perl and Javascript for example.

And Lisp would actually be pretty easy to integrate in such an environment, since its basic data-structures are pretty simple ... lists, numbers, strings, and functions. And why would you use a Scheme macro from Perl?

X86 is too low level to be of any use. For example a VM works with higher level data-structures and is capable of introspection.

Anonymous said...

VMS (the VAX OS) had a feature where library routines could be called from any supported language. Using the supplied compilers and a supported language, code written in C could call routines in a library compiled from BLISS or Fortran. Can GCC do this today?

A variety of languages run on the JVM and the CLR. Can 2 different languages running on the JVM call each others routines thru Java interfaces? That might be another way.

jaaron said...

Apache Thrift can help with cross-language process communication. It's basically a revision of Google protocol buffers which were re-invented and improved by Facebook. Check it out:

http://incubator.apache.org/thrift/

Anonymous said...

Why is Inline::Bad? You were sort of vague. Is this a technical issue?

Remush said...

"It always sounds cool to think about a "universal VM", but since programming languages are how programmers communicate, it doesn't seem much more feasible than trying to get everybody to speak Esperanto."
Well, now that you say that, I begin to think it would be easy after all. If a language like COBOL could generate JAVA code, anything should be possible. If English can be translated to Esperanto (and vice versa) why couldn't a stupid computer language not be translatable.
Moreover the first high level languages did just that: convert the program in Assembler.
The real problem is that the generated program must be able to take advantage of the different computer architectures. It would be stupid indeed if some program would run slower on a better architectured computer, just because it was skilfully written for a dumber machine, one simulating system services that are provided by the hardware.
Fortunately, human brains are more flexible, and Esperanto is already adapted to our architecture.