May 01, 2009

The future of Rails is Ruby 1.9 - real performance of 1.8, JRuby and 1.9 compared

This is the second post of my pre-RailsConf series of blog posts on Rails application performance optimization. I'll be presenting at RailsConf about this and many other aspects of performance optimization on May 6th, so you're welcome to join me at RailsConf and please watch for more performance-related articles in this blog.

See also the previous article in this series on improving Date class performance.


We saw lots of synthetic benchmarks proving that alternative ruby implementations (namely the only two finished ones - JRuby and Ruby 1.9) are faster than good old 1.8 aka MRI. But nobody so far ported the production application to prove they are faster in the real world conditions. We did.Turtle

We ported Acunote - our online enterprise project management and Scrum software to both JRuby and Ruby 1.9 and ran our set of performance benchmarks. We have 30 benchmarks that check the server response time for most typical user requests. We use those benchmarks to check for performance regressions, and this time we used them to check the Acunote's performance with alternative rubys.

"Ruby 1.9 is 2x faster than 1.8. But the real world improvements are even better."

There're four common request types for Acunote, ones that primarily involve date/time operations, rendering, numerical calculations and ActiveRecord/database operations.

Here are the timings for four requests, one for each type:

Acunote Benchmark Request Time, sec Improvement
1.8.6 JRuby 1.2.0 1.9.1 JRuby 1.9.1
Date/Time Intensive 1.23 0.58 0.53 2.1x 2.3x
Rendering Intensive 0.61 0.44 0.30 1.4x 2.0x
Calculations Intensive 2.57 1.79 1.33 1.4x 1.9x
Database Intensive 5.58 4.63 3.29 1.2x 1.7x

Here are those timings expanded with rendering and database time:

Acunote Benchmark Request Time / Rendering Time / DB Time, msec
1.8.6 JRuby 1.2.0 1.9.1
Date/Time Intensive 1230 / 0 / 65 580 / 1 / 45 530 / 1 / 37
Rendering Intensive 610 / 328 / 33 440 / 273 / 45 300 / 197 / 37
Calculations Intensive 2570 / 380 / 38 1790 / 215 / 54 1330 / 204 / 39
Database Intensive 5580 / 529 / 700 4630 / 209 / 1321 3290 / 221 / 685

As you can see from the table, Ruby 1.9 is 2x faster than 1.8. But the real world improvements are even better. Ruby 1.9 eliminates performance bottlenecks that were present in 1.8, e.g. Date class, numerical and string computations, template rendering. It lets us write more complex systems in Ruby than were ever possible before. In the immortal words of Avi Bryant, it lets us add a whole 'nother level of turtles.

The performance improvement from JRuby is not as substantial, but it's still 1.5x faster. One of the reasons for JRuby being slower than 1.9 is PostgreSQL JDBC driver and not-working "fast" mode.

Want to know why 1.9 and JRuby are so much faster? Are there any other improvements and regressions with alternative Ruby's? What happened to the garbage collection performance? To learn the answers, join me at my RailsConf session or watch for the more detailed analysis in this blog.

July 14, 2008

Why You Should Use 'Cache-Control: public' for Caching to Work with SSL

We've been using long Expires and Cache-Control: max-age headers to mark our javascripts and css as browser-cacheable for quite some time. This way the browser only does full GETs on these once, and avoids conditional GETs except on page refresh. This only requires a few lines of configuration in Rails (asset timestamps and asset packaging) and nginx and generally works well.

Except when it does not. We've had some bug reports that in some cases under SSL these files were not cached at all. Meaning, just going to a different page in our application would make browser do an unconditional GET to refetch css and javascript. Naturally, this would hurt the performance.

Now, SSL in itself makes things worse. Firefox only caches SSL content in memory not on disk, meaning if the user closes the browser the cache goes away. But the reported behavior was worse than that -- no caching at all.

It took us a while, but we did manage to chase it down. It seems that Firefox divides SSL content into page marked with Cache-Control: public, which have priority for caching, and the rest. When you open enough tabs, say 50+, with enough rich pages in Firefox, its in-memory cache fills up and refuses to cache non-public SSL content. We haven't had time to fully chase it down in the FF codebase, but what's likely to happen is that FF actually does put in into cache, then realizes that cache is full, looks for pages to evict, and right away evicts the newly added content. Since SSL content is not cached on hard disk, once the memory cache is full, the non-pubic SSL files are not cached at all. (Note that's just a hypothesis, we'd be most curious to know what actually happens and why.)

The server-side workaround is to add Cache-Control: public to all your SSL content you want cached. We do:

# nginx configuration for static content  
location /static {
    expires 1y;
    add_header Cache-Control public;
}

This bug is present in Firefox 2 and 3, but as a user you can configure them to work around the problem by setting browser.cache.memory.capacity to increase the size of memory cache or using browser.cache.disk_cache_ ssl to enable disk cache for SSL pages. The default 24M memory cache size for systems with 2G of RAM is really small, especially with FF3 being able to handle hundreds of tabs well. Note that the disk_cache option works only on Firefox 3 because of some bugs in earlier browser versions.

Small remark for Acunote users - you don't have to do anything about this. We enabled Cache-Control: public and your Firefox will cache javascript and css files no matter how many tabs you have.

February 13, 2008

Make Rails Associations Faster by Optimizing Named Blocks and String Callbacks

In our previous articles we described how Rails spends much of its time garbage collecting, and that significant speedup can be achieved by memory profiling and fixing memory allocation hotspots. In this article, we'll describe couple more such hotposts dealing with named block parameters and associations, and provide the patches.

Named Block Parameters Considered Harmful (for Performance)

We already wrote that passing a block to a method of ActiveRecord::Associations::HasManyAssociation instance and its friends chews up the memory. For example, a single call to association.select { |record| record.new_record? } can allocate up to 10K of memory depending on the association size. A brief look at associations source reveals that Rails itself has similar code in many places.

Each association is a proxy to the actual array of associated object(s). It seems like method_missing is a good way to implement proxy pattern in Ruby and indeed that's what Rails does. The proxy contains an array of associated objects and sends all missing methods in the proxy into that array. If we simplify the Rails code, we'll see something like this:

class Association
    def method_missing(method, *args, &block)
        @array.send(method, *args, &block)
    end
end

At first, we couldn't understand why this would be slow, but after some digging we got it. Each named &block parameter requires extra processing. Ruby creates a Proc object that represents the block passed and adds a Binding object with the local execution context to that Proc. In an empty Ruby script without any variables defined binding will be around 400 bytes. In actual Rails application bindings may grow up to 10K in size. Now imagine you're doing something with AR object and its association in a loop 100 times. Bah! 1 megabyte of memory is gone.

Each Ruby block is a closure, and it captures its complete environment at the time of creation. Ola Bini has a great article on this. So is all hope lost? No -- turns out that MRI has different implementations for named and anonymous block parameters. When calling a function which takes anonymous block, it simply stores a reference to the caller's stack frame. It's OK to do that since the callee is guaranteed to exit before caller's stack frame is popped. When calling a function that takes a named block MRI assumes that this block may be long-lived and clones the environment right there. So anonymous block parameters are much more efficient than named block parameters. Also see related discussion on Ruby Forum.

The optimization to Rails Association is simple - just pass a new block and yield the old one inside:

class Association
    def method_missing(method, *args)
        @array.send(method, *args) { |*block_args| yield(*block_args) if block_given? }
    end
end

This not only saves memory, but runs faster. I've benchmarked that on Acunote copying 120 objects (each with 6 associations) using ActiveRecord.

With named block parameters:

Benchmark Copy 120
memory: 97527K total in 1698240 allocations, GC calls: 13, GC time: 977 msec
time: 3.25 ± 0.05

With yields:

Benchmark Copy 120
memory: 92670K total in 1636677 allocations, GC calls: 12, GC time: 901 msec
time: 3.15 ± 0.05

As the result, 5 megabytes of memory and 100msec saved for good.

That's Cool! Where's The Patch?

String Callbacks Considered Harmful (for Performance)

This one is even more interesting. Rails allows for string callbacks in before_save, after_save, before_destroy and so on in ActiveRecord models. Each such callback is a string that is evaluated in the context of AR object. Let me cite Rails callbacks.rb source here:

...
def callback(method)
    notify(method)

    callbacks_for(method).each do |callback|
        result = case callback
            when Symbol
                self.send(callback)
            when String
                eval(callback, binding)
            when Proc, Method
                callback.call(self)
            else
            ...

You see, to evaluate the string we need to get the binding. And as we all remember from our named block parameter discussion, the binding takes memory. Even when you don't use string callbacks yourself, Rails associations automatically create them for you.

For example, has_many will define 4 string callbacks. You'll get before_save, after_create and after_update to assure that new associated records are saved when its parent record is saved; and also you'll get one for before_destroy that destroys dependent objects or nullifies their foreign keys.

Rewriting string callbacks into symbol callbacks gives a tangible performance boost. I did that change and benchmarked Acunote again.

With string callbacks in associations:

    Benchmark Copy 120
    memory: 92670K total in 1636677 allocations, GC calls: 12, GC time: 901 msec
    time: 3.15 ± 0.05

With symbol callbacks in associations:

    Benchmark Copy 120
    memory: 39108K total in 944764 allocations, GC calls: 6, GC time: 479 msec
    time: 2.45 ± 0.05

Whoa! Rewriting string callbacks to symbol callbacks saved 52 megabytes and gave 0.7 sec speedup. Nice!

That's Cool! Where The Patch?

February 06, 2008

Make Your Ruby/Rails App Fast: Performance And Memory Profiling Using ruby-prof and KCachegrind

Why Profiling and What to Profile?

Last week we showed how we sped up Acunote (our agile project management application) using memory profiling. That was the fish -- you can apply patches from that article and likely get nice boost for your app (we got 5x). Today we give you the net and show how to identify performance and memory usage bottlenecks in your Ruby/Rails application.

Profiling performance and memory separately dramatically simplifies optimizing ruby code, and turns it into a deterministic profile/fix/go-faster/repeat loop, rather than (profile or guess)/fix/(get disappointed)/(get discouraged) loop it was before. Read on to learn how.

Preparations And Tools

First, get Ruby 1.8.6 sources, patch them, compile and install. Ruby 1.8.5 will work the same patch too. Vanilla Ruby GC doesn't give us enough information, hence the need for a patch (composed from Sylvain Joyeux's memory allocations patch, Stefan Kaes' Railsbench GC patch, and my own memory counter; the patch applies to both Ruby 1.8.5 and 1.8.6). It's a good idea to install patched Ruby into its own prefix (say, configure --prefix=~/bin/ruby) and to adjust the $PATH every time you need to profile.

Second, install the recently released ruby-prof 0.6.0. gem install ruby-prof or gem update ruby-prof should do the job. ruby-prof 0.5.2 does not have the memory profiler we'll need, but you can patch it if you really need to. It is important to have and use patched ruby before you start installing ruby-prof gem, otherwise installed gem will not have support for memory profiling compiled in.

Measure/Benchmark

Before profiling you need to measure the current state. Forget about Rails logger and request time recorded there. You need information about:

  • request time;
  • size of memory allocated during the request;
  • number of objects allocated during the request;
  • number of GC collections;
  • time spent in GC.

To get request time we'll use Benchmark.realtime {} The get memory use and GC stats we'll use this helper:

#Executes block and collects GC statistics during the block execution.
#Collected stats are printed to stdout (or to the file set in $RUBY_GC_DATA_FILE env var):
# - allocated memory size (in KB) during block execution
# - number of memory allocations during block execution
# - number of GC collections during block execution
# - time (in milliseconds ) spent in GC
#
#Description string appears in stdout before statistics
#Options are
# - :disable_gc => true    - disables GC during execution
# - :show_gc_dump => true  - shows GC heap dump after statistics
def gc_statistics(description = "", options = {})
    #do nothing if we don't have patched Ruby GC
    yield and return unless GC.respond_to? :enable_stats

    GC.enable_stats || GC.clear_stats
    GC.disable if options[:disable_gc]

    yield

    stat_string = description + ": "
    stat_string += "allocated: #{GC.allocated_size/1024}K total in #{GC.num_allocations} allocations, "
    stat_string += "GC calls: #{GC.collections}, "
    stat_string += "GC time: #{GC.time / 1000} msec"

    GC.log stat_string
    GC.dump if options[:show_gc_dump]

    GC.enable if options[:disable_gc]
    GC.disable_stats
end

I personally find it Zen to create a special kind of integration tests for benchmarks (let's call them "performance tests").

Using gc_statistics helper a performance test can look like this:

class PostsPerformanceTest < ActionController::IntegrationTest
    def test_index
        puts "posts/index:"
        benchmark_time :get, 'posts/index'
        benchmark_memory :get, 'posts/index'
    end

    def benchmark_time(method, url)
        measured_times = []
        10.times { measured_times << Benchmark.realtime { send(method, url) } }
        puts "time: #{mean(measured_times).to_02f} ± #{deviation(measured_times).to_02f}\n"
    end

    def benchmark_memory(method, url)
        gc_statistics("memory: ") { send(:method, url) }
    end

    def mean(values)
        values.sum / values.length
    end
    def deviation(values)
        m = mean(values)
        Math.sqrt(values.inject(0){|sum, a| sum + (a - m)**2} / values.length)
    end
end

Sample output from such test will look like:

posts/index:
time: 3.50 ± 0.01
memory: allocated: 150246K total in 1650700 allocations, GC calls: 19, GC time: 1500 msec

The results as in example output above suggest optimizing memory first because it's possible to save more than a second by producing less garbage and doing less garbage collections (19 is too much). My experience in optimizing Rails apps tells that normally you should aim at 1-2 GC calls per request and ideally you can get no GC collections at all. Of course, some long requests may inevitably use more memory but the rule of thumb is: less GC collections is better.

How to Profile

Profiling with ruby-prof is very similar to measurements we've seen above. Integration tests again become the Zen way to profile (let's call them "profiling tests").

Profiling test should:

  • run ruby-prof to generate processing time profile;
  • run ruby-prof to generate memory profile;
  • output profiling results in a suitable format (preferably, calltree).

For example, profiling test can look like this:

require 'ruby-prof'

class PostsController
    alias :index_orig, :index
    def index
        RubyProf.start
        index_orig
        result = RubyProf.stop

        measure_names = { RubyProf::MEMORY => 'memory', RubyProf::PROCESS_TIME => 'time' }

        printer = RubyProf::CallTreePrinter.new(result)
        printer.print(File.open("callgrind.out.posts_index_#{measure_names[RubyProf::measure_mode]}", 'w'))
    end
end

class PostsProfilingTest < ActionController::IntegrationTest
    def setup
        GC.disable
    end
    def teardown
        GC.enable
    end

    def test_index
        profile :get, 'posts/index'
    end

    def profile(method, url)
        RubyProf.measurement_mode = RubyProf::PROCESS_TIME
        send(method, url)
        RubyProf.measurement_mode = RubyProf::MEMORY
        send(method, url)
    end
end

Note that we do 2 separate profiler runs, one for PROCESS_TIME and one for MEMORY. That's because the profiler only supports measuring one thing at a time.

There are a few more things to note. The code disables GC for the time of request. Otherwise ruby-prof attributes full time cost of garbage collection to the method where GC gets triggered, not the methods that allocate the memory. For profiling we need deterministic behavior without GC. See original article for more info on this.

Second, the test records profiler information only for controller action. There's no point in profiling Rails integration testing framework and GCI dispatcher when we need to profile only our application.

Finally, and most importantly, the code outputs results in the calltree format for use with excellent KCachegrind profiler data visualization tool by Josef Weidendorfer.

Using KCachegrind

With KCachegrind you get all the goodies:

  • the list of methods sortable by both their total time and self time;
  • call graph visualization;
  • easy access to callers and callee's of any given function;
  • cycle detection in the code;
  • profiler-annotated source code listings;
  • more stuff like search, relative and percentage cost calculations, etc.

In short, with KCachegrind it is much easier to understand profiler output compared with using plain text and HTML formatted ruby-prof results. I guess the example screenshot below will be the best proof of that.
Kcachegrind

KCachegrind is a part of KDE kdesdk distribution. Linux'ers should not have any troubles installing it. On MacOS X you can install KCachegrind from either MacPorts with port install kcachegrind or from Fink. In both cases Apple X11 is necessary. Those who still are on Windows may try installing KCachegrind from KDE-Cygwin or resort to WinCacheGrind or even to HTML output.

By this point you should get all the data and only a few simple things left to do. KCachegrind's function profile sidebar will show you which methods took long to execute and which methods ate the memory. Call stack sidebar and call tree will show you where those methods are called from. Source listing will help you to traverse the call tree and locate hotspots. This is all you need to understand where you app is slow and where it eats lots of memory.

Next step is to fix these problems.

How to Optimize

Unlike profiling, optimization is not deterministic and there's no universal recipe on how to fix the slowdowns. There're some common tricks and although they are off topic here, I'll give you some most important ones:

  • look for Kernel#clone in profiler and fix your code - never clone objects;
  • do not use ActiveRecord::Base::attributes method, it clones attributes;
  • don't use named block parameters when you can, replace def myfun(&block) declarations with def myfun;
  • avoid stuff with O(n2) performance like acts_as_list and be careful with acts_as_tree as well;
  • execute fewer queries, especially with ActiveRecord;
  • don't use link_to_function, text_field_tag, check_box_tag and image_tag if you can;
  • preload objects with :include => '' find parameter where possible;
  • don't render too many partials.

Complete explanation of the hints is well worth the separate blog post, but that's for the future.

Optimize Today!

I hope you'll enjoy profiling and optimizing your application like I did. This is a greatly rewarding work, in some cases I got 5x improvements using the techniques described above. Some of the findings (including Ruby and Rails optimizations) are described in two previous blog posts:
Guerrilla's Guide to Optimizing Rails Applications
Garbage Collection is Why Ruby on Rails is Slow: Patches to Improve Performance 5x; Memory Profiling

In any case, if you have your own suggestions or improvements to the method, optimization tricks, please share them. I'll post them here.

January 31, 2008

Guerrilla's Guide to Optimizing Rails Applications

The Battle For Performance

Rails is slow. Your rails application which does fairy simple database operations will be born slow and will eventually become slower and slower while still not doing anything fancy. That's like getting a new powerful locomotive (on Rails) together with a free bonus of overladen freight cars.

Having said that, I can virtually hear the herd of voices screaming "dude, optimize your own application and don't blame Ruby and Rails for everything". Sure thing, our application isn't perfect and I do believe it can be optimized but you can't win the performance battle just by that.

The winning tactics for this battle is guerrilla tactics. It turns out you can guerrilla-patch (aka monkey-patch) Ruby and Rails and easily get up to 3x speedup in some cases. We started doing that to our own application - Acunote and got amazing results so far. Read on to find out how.

Ruby-Prof: The Guerrilla's Weapon

Acunote, our pet rails application, often need to copy a bunch of ActiveRecord (AR) objects. As any other project management tools, our application works with tasks organized into iterations (sprints). Sometimes you want to copy tasks from current iteration/sprint to the next iteration/sprint. And boy, that copy can become really slow. As an example, let's look at the performance and profiling data of one request to copy 120 tasks.

Let's measure 3 things:

  • the real time request takes in production environment;
  • processing time profile information using the excellent ruby-prof profiler by Charlie Savage and Shugo Maeda;
  • memory profile information gathered with patched ruby-prof (I took Sylvain Joyeux's memory allocations measurement mode patches and hacked together my own ruby-prof memory size measurement mode).

So, for "copy 120" request we get:

Request time (without profiler): 13.6 sec.

Time:
 Total: 16.78
 %self     total     self     wait    child    calls  name
 18.36      3.34     3.08     0.00     0.26    20766  Kernel#clone
  ...

Allocated Memory Size (in KB):
 Total: 1 001 730.00
 %self     total       self     wait    child    calls  name
 61.87  646795.00 619745.00     0.00 27049.00    20766  Kernel#clone
  ...

"WTF" politely describes my initial reaction when I first saw that. How on earth we can spend 17 seconds and consume 1 Gigabyte of memory by just copying 120 ActiveRecord objects? Ok, ok, we not only copy them but store some more information to the database like task tags, copy events and so on. Therefore we end up saving actually 360 AR objects instead of 120. Also the memory profiler is inaccurate and usually shows a bit more memory than the process actually takes.

Still, what is going on during those 17 seconds? Why do we need a gig of RAM? Gee, on the same machine I can compile the whole ruby interpreter in 57 seconds and the compiler will never take more than 130M...

Heh, let's calm down and see what profiler says. The obvious suspect is Kernel#clone method. The (stripped) call tree for it as reported by ruby-prof is

Task#save
...
ActiveRecord::Base#create_without_callbacks
ActiveRecord::Base#attributes_with_quotes
ActiveRecord::Base#attributes
ActiveRecord::Base#clone_attributes
ActiveRecord::Base#clone_attribute_value
Kernel#clone

From Charlie's blog we know that AR::Base::attributes method is evil. It clones attributes before returning them and Charlie gave a wise advice to not use it. And indeed in your application is a good idea to call, for example, task['foo'] instead of task.foo or task.attributes['foo']. But here AR::Base::create itself does the evil thing.

Let me cite Rails code (from rails/activerecord/lib/activerecord/base.rb):

def update
    connection.update(
        "UPDATE #{self.class.table_name} " +
        "SET #{quoted_comma_pair_list(connection, attributes_with_quotes(false))} " +
        "WHERE #{self.class.primary_key} = #{quote_value(id)}",
        "#{self.class.name} Update"
    )
end

def create
    if self.id.nil? && connection.prefetch_primary_key?(self.class.table_name)
        self.id = connection.next_sequence_value(self.class.sequence_name)
    end
    self.id = connection.insert(
        "INSERT INTO #{self.class.table_name} " +
        "(#{quoted_column_names.join(', ')}) " +
        "VALUES(#{attributes_with_quotes.values.join(', ')})",
        "#{self.class.name} Create",
        self.class.primary_key, self.id, self.class.sequence_name
    )
    @new_record = false
    id
end

def quoted_column_names(attributes = attributes_with_quotes)
    attributes.keys.collect do |column_name|
        self.class.connection.quote_column_name(column_name)
    end
end

def attributes_with_quotes(include_primary_key = true)
    attributes.inject({}) do |quoted, (name, value)|
        if column = column_for_attribute(name)
            quoted[name] = quote_value(value, column) unless !include_primary_key && column.primary
        end
        quoted
    end
end

Great! AR::create clones all attributes twice (first with quoted_column_names and then with attributes_with_quotes). AR::update is a nicer guy, it clones attributes only once. For each of those 120 copied tasks we use AR::create twice and AR::update once. Therefore we call clone_attributes not less than 120*2*2+120 = 600 times.

Profiler says we lose 3 seconds and 650M memory because of that:

ActiveRecord::Base#clone_attributes:
Time:          3.21 sec
Memory:  650 301.00 K

Ok. Time to employ our promised guerrilla tactics, but first let's talk about memory.

Lyrical Digression: Why Do We Care About Memory?

While guerrilla warriors take a nap before the battle, let's digress and look at why we measure the size of allocated memory. Sure thing, allocating literally tons of memory is bad. But we're using Ruby and Ruby has its own guerrilla warrior which strikes our application when it innocently eats its tasty memory.

The name of our enemy is "Ruby Garbage Collector" and the cunning enemy it is. As ruby's gc.c source says and Why the Lucky Stiff explains, ruby GC kicks in when you allocate more memory than is available on the heap and heap boundary is 8M. When we sequentially allocate 650 megabytes there's a chance that GC will be called 650/8=81 times. Each GC call accounts for approximately 70-80ms which in our case should add roundabout 5 seconds to the running time.

Let's do some math... Kernel#clone itself takes 3 seconds and we expect garbage collection to take 5 more seconds. In total, 8 seconds are wasted because of cloning. See the trend? The more memory we allocate the worse our performance is.

Here's the morale: "watch your memory". Large memory consumption costs you more than you'd otherwise think because of garbage collection.

Guerrilla patching AR::create and AR::update

Now it's time to employ our guerrilla tactics and fix ActiveRecord. It's easy to rewrite attributes_with_quotes and avoid cloning the attributes:

module ActiveRecord
    class Base
    private
        def attributes_with_quotes(include_primary_key = true)
            result = {}
            @attributes.each_key do |name|
                if column = column_for_attribute(name)
                    result[name] = quote_value(read_attribute(name), column) unless !include_primary_key && column.primary
                end
            end
            result
        end
    end
end

Now let's put that to environment.rb and see what we get:

Request time (without profiler): 6.7 sec.

Time:
 Total: 9.5
 %self     total     self     wait    child    calls  name
  4.84      0.69     0.46     0.00     0.23    24023  ActiveRecord::ConnectionAdapters::Quoting#quote
  ...
  2.22      0.52     0.18     0.00     0.34     2779  <Module::Benchmark>#realtime
  ...
  0.00      0.00     0.00     0.00     0.00        6  Kernel#clone
  ...

Allocated Memory Size (in KB):
 Total: 350 966.00
 %self     total       self     wait    child    calls  name
 34.31  125655.00 120420.00     0.00  5235.00     2779  <Module::Benchmark>#realtime
  ...
  0.00       0.00      0.00     0.00     0.00        6  Kernel#clone
  ...

Ok, much better: 650 megabytes of memory saved with that little monkey patch! We estimated that will save 8 seconds. Let's look:

Original request time:          13.6 sec
Request time after AR patch:     6.7 sec
Time saved:                      6.9 sec

It turns out our estimation was almost correct. We indeed saved not only the time taken by Kernel#clone but also the time spent by garbage collector and gained 2x speedup for free. Nice! Our guerrilla tactics clearly works and we can for sure do better than 2x.

Guerrilla-patching Benchmark.realtime

Next suspect, as seen from memory profile above, is Benchmark#realtime. From time profiler the method doesn't look bad. We spend only 180 milliseconds in it but look at memory profiler! It 180ms our Benchmark#realtime friend allocates 120 megabytes of memory. Guess what? Ruby garbage collector will be happy to kick our ass again at least 15 times and that estimates to about 1 second loss.

Let's see what's wrong in benchmark.rb:

def measure(label = "")
    t0, r0 = Benchmark.times, Time.now
    yield
    t1, r1 = Benchmark.times, Time.now
    Benchmark::Tms.new(t1.utime  - t0.utime, t1.stime  - t0.stime,
        t1.cutime - t0.cutime, t1.cstime - t0.cstime, r1.to_f - r0.to_f, label)
end

def realtime(&blk)
    Benchmark::measure(&blk).real
end

Great, we calculate lots of useless information (like cpu time, user time, etc) in Benchmark#measure and then just throw it away. Let's do better, simpler and more memory efficient:

module Benchmark
    def realtime
        r0 = Time.now
        yield
        r1 = Time.now
        r1.to_f - r0.to_f
    end
    module_function :realtime
end

That's it. No extra object creation. And here's our income:

Request time (without profiler): 5.8 sec.

Time:
 Total: 8.63
 %self     total     self     wait    child    calls  name
  4.29      0.69     0.46     0.00     0.23    24023  ActiveRecord::ConnectionAdapters::Quoting#quote
  ...
  0.35      0.19     0.03     0.00     0.16     2779  <Module::Benchmark>#realtime
  ...

Allocated Memory Size (in KB):
 Total: 225 668.00
 %self     total       self     wait    child    calls  name
 43.78   98805.00  98805.00     0.00     0.00     5280  Kernel#binding
  0.00    1918.00      2.00     0.00  1916.00     2779  <Module::Benchmark>#realtime
  ...

Whoa! 120 megabytes and 1 second were saved with 9-liner guerrilla patch. Nice, isn't it? Instead of original 13.6 sec our request now runs in 5.8 sec. That's more than 2x improvement already and we can do even better, but I'll leave that for the next blog.

When Guerrilla Tactics Help

Nothing with Ruby and Rails is wrong until you do things in the loop. For example, Benchmark#realtime doesn't have any impact if all you do is save one AR object. You just allocate 25 extra objects that take 45 extra kilobytes. But you're doomed once you do that in a loop. Do the benchmark 200 times and you'll guaranteed the garbage collector will happily shoot you in the back.

Also bear in mind that garbage collector should not be running while you profile your code. With GC you'll never get repeatable results because it will kick in at different times randomly increasing the time and memory consumption of your methods.

Make Your Application Fast: Optimize MEMORY

Let me reiterate, by optimizing memory usage instead of (or in parallel with) processing time you can get significant performance improvement, and guerrilla patching memory-inefficient bits is the right way to make Ruby and Rails faster I think we're on the right track here. Memory is what we need to optimize. Memory is what wasn't optimized before. And you've just seen what we can get by optimizing memory.

We (Acunote devs) will continue the memory profiling and will blog as soon as we get new results. Our previous blog post briefly summarize our current results, be sure to read it if you haven't done that yet. Meanwhile, don't wait for us, grab ruby-prof, my memory profiler patches and dig by yourself ;) I bet you'll find more big memory eaters and when you do so, please drop us a note. Let's collect our guerrilla patches and then file them upstream.

Garbage Collection is Why Ruby on Rails is Slow: Patches to Improve Performance 5x; Memory Profiling

  • The News: Ruby on Rails performance is dominated by garbage collection. We present a set of patches to greatly improve Rails performance and show how to profile memory usage to get further performance gains.

  • What's at Stake: Rails is slow for many uses and did not lend itself well to optimization. Significant performance gains could only be achieved at application level at large development cost.

  • The Upside:

    • 5x potential performance gains;
    • easy way to identify whether GC is a bottleneck;
    • deterministic process to fix memory bottlenecks;
    • set of canned patches to solve the biggest problems;
    • you can help

I don't have time, make my app 5 times faster!

  • Apply the monkey patches below. Enjoy.
  • Time your app with garbage collection disabled to figure out how much more performance you are leaving on the table. Enjoy.
  • Come back for more patches. Enjoy.

What's going on here?

Ruby runtime uses garbage collection. Your code allocates memory as it runs. From time to time garbage collector kicks in, halts the whole system, and cleans up no-longer-referenced memory using mark-and-sweep algorithm.

GC gets triggered by multiple conditions. The one that matters the most is 8M trigger. Every time you allocate 8M of memory GC runs. Complex Rails request can allocate hundreds or even thousands of megabytes of memory, making GC runs dozens of times. Each GC pass takes 50-150ms. You do the math.

It's possible to patch ruby interpreter, increase the triggers and reduce frequency of garbage collection at the cost of additional memory use. This helps, but does not resolve the problem. If there is more garbage, each GC run takes longer. We'll devote a separate post to this later.

But, I am running ruby-prof already, you ignorant fool!?

And so are we. It does not really help:

  • ruby-prof attributes full cost of garbage collection to the method where GC gets triggered, not the methods that allocate the memory. That's like blaming the last straw for breaking the camel's back.
  • ruby-prof does not show that garbage collection took place and how much time it took. All you can tell is that some method took a long time.

If you go by the these result, and remove the straw that's breaking the camel's back his load does not get much lighter.

Statistics being on your side, things are not entirely hopeless. GC gets triggered within malloc, so methods doing a lot allocations are also likely to trigger GC. Further, simply allocating memory is relatively expensive, and that will also show up in profiler output.

You might get lucky, optimize ruby-prof hotspot and get better performance, but more often than not you will fail. The correlation is just too weak. We believe this is exactly why Rails resisted performance optimization. Up till now.

Profiling memory allocations

Ruby-prof is almost the right tool for memory profiling. "Almost" is because the free cheese is only found in the mousetrap. You'll have to do some work first.

You need to build patched ruby (patch here) and latest ruby-prof (patch here). With both tools patched, set RubyProf.measure_mode = RubyProf::MEMORY before running the profiler.

Make sure to turn off GC when profiling for pure performance. You'll get get much more representative results.

The detailed HOWTO and description on the process is in our next blog post.

Will it help me?

In general your code will benefit most if you:

  • Create or update a lot of rows using ActiveRecord
  • Run large number of SQL statements of any kind using ActiveRecord, e.g. load many AR object through separate #find calls
  • Use ActiveView helpers, especially in a loop, to generate complex HTML
  • Use has_many AR association

It's easy to tell for sure if you have performance tests. All that "performance test" means is that you can run a single request and time how long it takes.

Just run the same test as usual, and then with GC disabled and compare the results. Most code will run much faster with GC disabled. E.g. for Acunote we test how long it takes to copy 120 tasks from one sprint from another.

Before optimizations with GC:   14 sec
Before optimizations w/out GC:   6 sec

You can't run ruby without GC in production. It will rapidly run out of memory and crash. You can, however, apply optimization, produce less garbage and get your code be even faster than original benchmark with GC turned off!

After optimizations with GC:     3 sec

The code that allocates little memory, but is algorithmically complex and is slow because of that, will see little improvement. For example Acunote's burndown and sprint completion date calculations fall squarely into this category and did not benefit. These aside, Acunote performance was dominated by the cost of memory allocation.

Give me the patches!

Problem Patch
ActiveRecord::Base#create calls attributes_with_quotes twice, ActiveRecord::Base#update once. #attributes_with_quotes calls #attributes, which clones all attribute values. None of these clones are necessary.

Impact:
When copying 120 tasks in Acunote this costs 650M. Patch improves performance from 14s to 6s.
Notes:
Helps when you do a lot of creates and updates through ActiveRecord.
Rails 1.2

module ActiveRecord
    class Base
    private
        def attributes_with_quotes(include_primary_key = true)
            result = {}
            @attributes.each_key do |name|
                if column = column_for_attribute(name)
                    result[name] = quote_value(read_attribute(name), column) unless !include_primary_key && column.primary
                end
            end
            result
        end
    end
end
Rails 2.0

module ActiveRecord
    class Base
    private
        def attributes_with_quotes(include_primary_key = true, include_readonly_attributes = true)
            quoted = {}
            @attributes.each_pair do |name, value|
                if column = column_for_attribute(name)
                    quoted[name] = quote_value(read_attribute(name), column) unless !include_primary_key && column.primary
                end
            end
            include_readonly_attributes ? quoted : remove_readonly_attributes(quoted)
        end
    end
end
Rails surrounds every SQL call and ActionView::Base#render with Benchmark#realtime. Benchmark#realtime allocates unnecessary 45k per call.

Impact:
When copying 120 tasks in Acunote this costs 120M. Patch improves performance from 6s to 5s.
Notes:
Helps when you do read, create, update lots of AR records, or execute any other SQL, or call #render(:partial) in a loop.

module Benchmark
    def realtime
        r0 = Time.now
        yield
        r1 = Time.now
        r1.to_f - r0.to_f
    end
    module_function :realtime
end
Rails 1.2 tries to pretty-print and log every SQL statement even when logger is in Logger::INFO mode (production environment) and prints nothing.

Impact:
When copying 120 tasks in Acunote this costs 17M. Patch improves performance by 1s.
Notes:
Helps when you do read, create, update lots of AR records, or execute any other SQL in a loop.
Patch only applicable to Rails 1.2. Rails 2.0 has the correct log_info implementation.
Rails 1.2

module ActiveRecord
    module ConnectionAdapters
        class AbstractAdapter
        protected
            def log_info_with_level_check(sql, name, runtime)
                return unless @logger and @logger.level == Logger::DEBUG
                log_info_without_level_check(sql, name, runtime)
            end
            alias_method_chain :log_info, :level_check
        end
    end
end
Rails 2.0
No patch necessary.
Comparison of BigDecimal's (used by Rails for Numeric data types) to booleans is slow because it requires unnecessary method_missing call and exception catch.

Impact:
When rendering 120 tasks in Acunote this costs 4M. Patch improves performance by 100-120ms.
Notes:
Helps when you have the imprudence to accidentally compare BigDecimal's with true or false.

class BigDecimal
    alias_method :eq_without_boolean_comparison, :==
    def eq_with_boolean_comparison(other)
        return false if [FalseClass, TrueClass].include? other.class
        eq_without_boolean_comparison(other)
    end
    alias_method :==, :eq_with_boolean_comparison
end
ActionView::Helpers::JavaScriptHelper#link_to_function allocates lots of memory when run. It runs every time view is rendered.

Impact:
When rendering 120 tasks in Acunote this costs 5M. This improves performance by 100-110ms.
Notes:
Helps when you have rich AJAX page with lots of dynamic elements. It's worth elimitinating helpers in hostposts identified by memory profiler. Cumulatively this has huge impact on render-bound requests.
Hand-replace link_to_function with <a href="#" onclick="..."/>.
ActionView::Helpers::FormTagHelper#text_field_tag allocates lots of memory when run. It runs every time view is rendered.

Impact:
When rendering 120 tasks in Acunote this costs 2M. This improves performance by 40-50ms.
Notes:
Helps when you have page with lots of input boxes. It's worth elimitinating helpers in hostposts identified by memory profiler. Cumulatively this has huge impact on render-bound requests.
Hand-replace text_field_tag with <input type="text"/>.
ActionView::Helpers::AssetTagHelper#image_tag allocates lots of memory when run. It runs every time view is rendered.

Impact:
When rendering 120 tasks in Acunote this costs 8M. This improves performance by 110-150ms.
Notes:
Helps when you have page with lots of images. It's worth elimitinating helpers in hostposts identified by memory profiler. Cumulatively this has huge impact on render-bound requests. image_tag itself can be optimized. Help wanted.
If you don't use asset hosts, hand-replace image_tag with <img src="..."/>.
If you use asset hosts, there's no patch.
Passing a block to a method on instance of ActiveRecord::Associations::{HasAndBelongsToManyAssociation /etc.} chews up memory.
An example (from activerecord/lib/activerecord/associations.rb) is association.select { |record| record.new_record? } line inside #add_multiple_associated_save_callbacks method.

Impact:
This is used all over ActiveRecord code, and often by application code.
Notes:
We have been unable to figure out the root cause. Help wanted. Separate article to come.
None known.
Sometimes it helps to convert association to array before using, for example association.to_a.select { |record| record.new_record? }

How can I help?

  • Profile memory allocations for your Rails application. Figure out and fix hotspots. Post or link to patches and notes in the comments section. We'll be updating this post, as you and we figure out more.
  • If you are writing a large commercial Rails application, have one of you engineers do some memory profiling. Fast is a feature, and it's a nice way to contribute to the community.
  • Help us figure out the some of the issues we found, but haven't been able to figure out yet.

Submitted Patches

Thanks To

  • Alexander Dymo for actually doing all the work
  • Alexander Goldstein for directing us to look at GC in the first place
  • DHH and rails-core for giving us something to optimize, and for making Rails code so easy to understand and modify
  • Charlie Savage and Shugo Maeda for ruby-prof
  • Sylvain Joyeux's for original memory allocations number measurement patch
  • Stefan Kaes for all his performance work, railsbench and especially GC patch

Sightings

Updates

  • Update 1: noted that patches are against 1.2.4 Requested patches against 2.0
  • Update 2: added patches against Rails 2.0
  • Update 3: added "submitted patches" section with upstream patches
  • Update 4: #realtime and #attributes_with_quotes patches accepted upstream
  • Update 5: added Sightings section

Buzzletter

  • Hear about our new Ruby on Rails performance improvements, hacks, recipes, plugins & more. Enter your email below