Monday, July 7, 2008

A Unified Runtime for Dynamic Languages.

This seems like a great starting point for a unified runtime for a large set of dynamic languages. Having studied the Ruby runtime, I believe it could be implemented in a prototype-based language. Imagine Ruby, Javascript, Smalltalk, and Objective-C all living in the same runtime.


I've been chomping at the bit to do something with LLVM, and combining this runtime with JIT's for Ruby and Javascript seems like the Holy Grail to me.


Combined with all the optimizations that the Self guys pioneered, and the fact that Gnustep/Cocoa/NextStep has to be the least painful way I've found to develop GUI apps, this is something I plan on pursuing Real Soon Now.

Wednesday, June 25, 2008

VALUE: One thing to reference them all.

Common to the implementation of a lot of Lisps, Schemes, and other dynamic languages is VALUE. These usually hold a pointer to whatever is considered an object in the system. For performance reasons, some values are stored directly in the VALUE, commonly called 'immediate values'. Since pointers are 4-byte aligned on modern architectures, we know that any value with the final two bits set to 1, 2, or 3 isn't a pointer to some heap or stack allocated object, and we can use this to our advantage to store immediate values for things like integers that would be too costly to allocate. Ruby does this by left shifting integers on bit and adding one to them. Thus, value with the low bit set is an integer (Fixnum) and it's value can be found by right shifting one bit. Ruby uses a special macro for this to handle architectures that don't preserve the sign bit in right shifts. Though I don't know what architectures do this, I do know that LLVM has a instruction that specifically preserves this bit.

Also in Ruby, false, or Qfalse in the C code, is implemented as 0x00, Qtrue is 0x02, and Qnil is 0x04.

There are other ways to do this. Ages ago I worked on a system called LeLisp, which used separate memory zones for each type of object. Thus no tagging was necessary, you just rounded the address and it gave you a zone type address.

I was reading a paper the other day on implementing Scheme for microcontrollers, and it gives a good overview of the various methods of implementing VALUE.

Thursday, June 12, 2008

Lately I've been working on a new hash table implementation, combining ideas from several papers, and benchmarking every little idiom. As impressive as modern compiler optimization has become, it's still interesting to find amazing speedups you can achieve by avoiding conditionals. For example, I have a Bucket class that holds up to four items. Now I could store the count of items in the bucket, or I could count them on demand, an issue of speed versus memory. Should I choose to count them, then how I choose to count them becomes important. The naive way might look like:


size_t size(char* keys)
{
int i;
int count = 0;
for (i = 0; i < 4; ++i)
if (keys[i] > 0)
count++;
return count;
}


I won't bore you with the body of main that I used to test this with, but basically calling this MAX_INT times, (and doing something with the return so the compiler won't optimize it all away), I get these times on a 1.8 mhz core duo:


real 0m12.796s
user 0m12.785s
sys 0m0.000s


So lets look at an unconventional way to compute size:

int size(char* keys)
{
return (keys[0] != 0) +
(keys[1] != 0) +
(keys[2] != 0) +
(keys[3] != 0);
}

This version, in the same loop as the previous example, turns in the following times:

real 0m0.004s
user 0m0.004s
sys 0m0.000s

Oh, yeah. Considering this function can get called on every insertion in my modified Cuckoo Hash, this is a big improvement. Obviously the new hand unrolled version isn't as flexible wrt bucket size, so let's see how a hybrid approach might perform:

int size(char* keys, size_t bsize)
{
unsigned int count = 0;
int i;

for (i = 0; i < bsize; ++i)
count += (keys[i] != 0);
return count;
}

Okay this still performs better than the original:

real 0m2.200s
user 0m2.196s
sys 0m0.004s

Not quite as good, but still better if you need the flexibility of specifying array size at runtime. I'm sure there's a way to describe a recursive C++ template to get the benefit of the unrolled version if you are content to specify array size at compile time.

Of course, the unrolled version could also use SIMD instructions for an even greater speedup.

Wednesday, June 4, 2008

I Hate Virtual Machines


When I first started working at Amazon, a change to one of their low-level libraries would trigger a rebuild that could take a week to recompile all affected packages. My focus was improving their build times. One build tool, used to convert IDL-like descriptions to code, and might be invoked a hundred times for a given library, was written in Java. For the hell of it, I wrote a program in C++ that would produce the same output for a given input. It could be invoked eighty times for each single invocation of the original Java program. So when some Java apologist suggests that "Java is as fast as C or C++", they always fail to mention the head start you need to give them (note the speed tests in the Computer Language Shootout are for Java Server, which means I have to have the damn virtual machine running and taking up memory even when I am not using it). 

Ultimately the solution to Amazon's build times involved dependency analysis and parallelization in a multi-machine build farm. But I stand by my hatred of all current virtual machine implementations. Anyone who writes a command line tool in Java, Smalltalk, Lisp, C#, or any other language implementation that takes longer to start up than actually compute the solution should be castrated so that their defective genes are not passed on to future generations.

How about a virtual machine that did some partial evaluation to reduce the solution to something that didn't require the entire 'X' class library? In other words, a compiler that created a new smaller virtual machine based on the input text. Of course, in dynamic languages, if the evil "eval" is used, then you need to include everything. But how many real programs need this?



 


Beyond Scripting Language OO: Minimal Object Models


One of the frustrations of any object model is living with the implementation. How can I change the rules of multiple inheritance (never mind the total insanity of wanting multiple inheritance)?

If you've ever used something like CLOS, you have never realized that you are a slave to the object model of your language. Can you change the inheritance rules? Can you change the method lookup algorithm? Can you change the object structure?

In a "programmable programming language" you can change the rules of the object system, defined by a MOP.

To me the holy grail is making as many parts of the implementation "first class" as possible. Can I tweak the virtual method tables? Can I reference closures, scopes,  or stack frames as some data type that I can manipulate and change the behavior of the language.

So what sort of minimal, least restrictive implementation could we do for the platform for dynamic languages. I don't know the answer, but Ian Piumarta has written some interesting papers on the subject. Be sure to check out id-object, which combined with LLVM seems like ripe ground for experimentation.

Dynamic Language Optimizations: Polymorphic Inline Caches

The biggest performance problem with dynamic languages is that almost everything reduces to a lookup of a method call:

x = a + b
This actually works out to:
  • send the '+' message to the object referenced by 'a' with an argument of 'b'.
  • if the class of object 'a' doesn't respond to '+', consult the parent class of 'a'.
  • if the parent class doesn't respond to '+' continue up the class hierarchy until you find a class that does.
  • if no ancestor class responds to '+' see if the object referenced implements 'method_missing'.
  • if not do the same search up the ancestor tree for this method.
  • once you have found an implementation of '+' or 'method_missing', call it with the argument 'b'.
  • take the result of this method call and use it as an argument and send the '=' message to 'x' .
  • Rinse and repeat.
So the first optimization at a given call site is to cache the result of the last lookup. So next time through if 'a' is the same class as before, we already know the method pointer, assuming a long laundry list of items ('a' hasn't implemented an instance method of the same name, no one has modified the definition of this method in the parent classes, etc).

The next optimization is polymorphic inline caches. Of course, every paper by the Self guys should be studied. 


This software project goes to eleven

Here's a list of the first eleven things you do for a new software project. Note that this is "not advice, merely custom" (with apologies to Mark Twain):

  1. Come up with a cool name that hopefully you won't be sued for.
  2. Have t-shirts made for all the developers with the project name. This is especially cool since most good developers' wardrobe consists entirely of swag.
  3. Write a string class in C++, since all the other string libs suck.
  4. Decide that rather than base the internal representation on one good standard (UTF8, UTF16, ShiftJIS, etc), that you'll support all representations and fix it with implicit casting operations.
  5. Write a smart pointer and decide to retrofit the growing code base with reference counting.
  6. Decide to switch from C++ to Java, because C++ sucks.
  7. Rewrite all the code using Swing for the GUI.
  8. Decide to switch back to C++ because Java is "Write-Once Debug Everywhere."
  9. Decide that desktop apps are a thing of the past and rewrite in PHP.
  10. Realize that your PHP code is unmaintainable, and discover what PHP is really an acronym for.
  11. Rewrite your web app in Ruby on Rails and find it has one tenth the lines of code and one tenth the performance.

Actually I like Ruby and Rails, this is probably more a commentary on the past twenty years of my career than any one technology.


Imagine what success Billy Gibbons would have if he invented a programming language.
I've spent the last month of my daily bus rides exploring LLVM. It's a really great framework, and now I've gone past the toy languages in the tutorials to work on wrapping the API in Ruby. I spent some time trying to wrap the C++ API using Rice and/or rb++, the latter is not quite complete enough yet, and the former reminds me of why I enjoy Ruby so much over my days of hacking in C++, namely the horror of template error reporting in most C++ compilers:


g++ -I. -I. -I/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/universal-darwin9.0 -I. `llvm-config --cxxflags core` -I/usr/local/include -fno-common -arch i386 -Os -pipe -fno-common -Wall -g -c llvm.cpp
/usr/local/include/rice/detail/../detail/to_ruby.ipp: In static member function ‘static Rice::Object Rice::detail::to_ruby_::convert(const T&) [with T = llvm::Constant*]’:
/usr/local/include/rice/to_from_ruby_defn.hpp:67: instantiated from ‘Rice::Object to_ruby(const T&) [with T = llvm::Constant*]’
/usr/local/include/rice/detail/../detail/Auto_Member_Function_Wrapper.ipp:1188: instantiated from ‘static VALUE Rice::detail::Auto_Member_Function_Wrapper::call(VALUE, VALUE, VALUE) [with Func_T = llvm::Constant* (llvm::Module::*)(const std::string&, const llvm::FunctionType*), Ret_T = llvm::Constant*, Self_T = llvm::Module, Arg1_T = const std::string&, Arg2_T = const llvm::FunctionType*]’
/usr/local/include/rice/detail/../detail/Auto_Member_Function_Wrapper.ipp:1171: instantiated from ‘Rice::detail::Auto_Member_Function_Wrapper::Auto_Member_Function_Wrapper(Func_T, const Rice::detail::Exception_Handler*) [with Func_T = llvm::Constant* (llvm::Module::*)(const std::string&, const llvm::FunctionType*), Ret_T = llvm::Constant*, Self_T = llvm::Module, Arg1_T = const std::string&, Arg2_T = const llvm::FunctionType*]’
/usr/local/include/rice/detail/../detail/wrap_function.ipp:89: instantiated from ‘Rice::detail::Wrapped_Function* Rice::detail::wrap_function(Ret_T (Self_T::*)(Arg1_T, Arg2_T), const Rice::detail::Exception_Handler*) [with Ret_T = llvm::Constant*, Self_T = llvm::Module, Arg1_T = const std::string&, Arg2_T = const llvm::FunctionType*]’
/usr/local/include/rice/detail/../detail/define_method_and_auto_wrap.ipp:16: instantiated from ‘void Rice::detail::define_method_and_auto_wrap(VALUE, const char*, Fun_T, const Rice::detail::Exception_Handler*) [with Fun_T = llvm::Constant* (llvm::Module::*)(const std::string&, const llvm::FunctionType*)]’
/usr/local/include/rice/detail/../Module_impl.ipp:66: instantiated from ‘Derived_T& Rice::Module_impl::define_method(const char*, Func_T) [with Func_T = llvm::Constant* (llvm::Module::*)(const std::string&, const llvm::FunctionType*), Base_T = Rice::Data_Type_Base, Derived_T = Rice::Data_Type]’
llvm.cpp:51: instantiated from here
/usr/local/include/rice/detail/../detail/to_ruby.ipp:7: error: creating array with size zero (‘no_to_ruby_conversion_defined’)
make: *** [llvm.o] Error 1

Clear as mud, no? 

I also gave swig a try, but it choked on the C++ headers. I'll probably give it a try using the C API, or just create some Ruby to parse the c headers and generate an interface.



Saturday, May 31, 2008

Dynamic Language Optimizations: Memoization

The real raison d'ĂȘtre for this blog is to simply record my progress and notes in my current exploration of dynamic language implementation and optimization. First up I'd like to talk about memoization, which is basically trading space for time. If we have a function or method that is not affected by external variables, i.e. it references no global values or instance data,, then we can cache the result of each call to the function, and on subsequent calls with the same input, return the cached value instead of redoing the computation.

For example, here's a naive recursive implementation of the Fibonacci series in Ruby and in C, along with execution times:

def fib(n)
if n == 0 || n == 1
n
else
fib(n-1) + fib(n-2)
end
end

36.times do |i|
puts "n=#{ i} => #{fib(i)}"
end

real 0m54.742s
user 0m54.613s
sys 0m0.077s

#include "stdio.h"

int fib(int n)
{
if (n == 0 || n == 1)
return n;
else
return fib(n-1) + fib(n-2);
}
main()
{
int i;
for (i = 0; i &lt; 36; ++i)
printf("%d =&gt; %d\n",i,fib(i));
}

real 0m0.666s
user 0m0.663s
sys 0m0.002s


"OMG! Ruby is eighty times slower than C!", I hear you say. But I would add, for certain values of X, specifically those values that are in the set of stupid benchmarks.

So let's apply memoization to this Ruby script, a whopping two additional code lines and time it again:


def fib(n)
@cache ||= Hash.new
return @cache[n] if @cache.has_key? n
if n == 0 || n == 1
@cache[n] = n
else
@cache[n] = fib(n-1) + fib(n-2)
end
end

36.times do |i|
puts "n=#{ i} => #{fib(i)}"
end

real 0m0.009s
user 0m0.005s
sys 0m0.004s


Hmmm. With a little less naivete, our intrepid Ruby programmer has made his version run over seventy times faster than the naive C version. I suppose we could improve the C version:

#include "stdio.h"
#include "stdlib.h"

int fib(int n)
{
static int length = 0;
static int* cache = 0;

if (length &lt; n) {
cache = malloc(sizeof(int)*(n+1)*2);
length = (n + 1) * 2;
}
if (n == 0 || n == 1)
return n;
else if (cache[n] == 0)
cache[n] = fib(n-1) + fib(n-2);

return cache[n];
}
main()
{
int i;

for (i = 0; i &lt; 36; ++i)
printf("%d => %d\n",i,fib(i));
}
real 0m0.004s
user 0m0.001s
sys 0m0.003s


Looks like it can help C also. But the improvement to Ruby is more drastic. In fact, Ruby has a gem called memoize that allows you to wrap methods you wish to cache the result of.

Wouldn't it be nice if our compiler or interpreter could recognize when it could cache method returns automatically? In fact there has been a bit or more
of research on this.

Frist Psot!

Just what Teh Internets needs, another blog (well and maybe some of this).