The ActiveRecord Memory Mystery
Problem
In our Rails application, we had a process that imported a CSV using ActiveRecord. The process would work just fine for CSVs with less than a thousand records. But given a larger CSV, say a hundred thousand records, memory would skyrocket and the app would crash. We wanted to optimize the process to be able to handle an arbitrarily large CSV.
Most discussions I could find online about memory issues and CSVs centered around CSV.parse
vs CSV.foreach
. The source of the data we were trying to transform was from a CSV but in this case the source is irrelevant. The following discussion will center around ActiveRecord object collection.
Potential Solutions
Resetting the collection:
Breaking the reference:
Things I Learned Along the Way
- Ruby does not let us manually remove an object from memory.
ObjectSpace
can be used to see what objects are in memory.GC.start
will force garbage collection but only for objects that would have already been garbage collected.- Objects are garbage collected only when they leave scope.
- Objects cannot leave scope until all reference objects also leave scope.
- Objects built or created through an ActiveRecord reference are added to the target array.
find_each
holds the entire batch in memory and when it fetches a new batch, the old batch is marked for garbage collection. It is not immediately garbage collected so after a few batches of large associations it will usually hold around 3k records.
Discussion
Measuring Tools
It is hard to measure something we can’t see! Memory is one of those things. It is a dark hole of mystery in Ruby and Rails but that is often a huge benefit; we don’t have to spend time thinking about it because it just works. Until it doesn’t.
Lucky for us Ruby gives us ObjectSpace
which, among lots of other things, can tell us what kinds of objects are in memory.
This next code snippet was the real hero.
My first assumption was that any memory issues would likely arise out of long running loops. So I put that snippet in every loop one by one so that I could get a count of every ApplicationRecord object that was in memory. The output in the terminal looked like this:
We can solve a problem that we can see! Everything is static but the User
objects are growing!
Rails Associations
Objects built or created through a Rails association are added to the target array of the reference object. Because the target array is a reference to a created or built object, as long as the reference object with the target array is in scope the created objects cannot be garbage collected. Since that might not make too much sense, the following code should help us understand.
Now multiply the code above by 100,000 and we have a LOT of users in memory. In some cases this is what we want. We just created a user and we are going to do something with that user so we do not want it garbage collected.
But what if we only wrote the loop to create 100,000 user records in our database and will not be referencing them again?
Break the Association
The first approach is to break the association on creation.
Instead of:
We could use:
Problem solved right? Maybe!
If truly all we are doing is creating a record to be saved in a database, I would argue that breaking the association is the best approach. In this instance there is no need to keep every record in memory.
BUT what if every time a user is created, the company object needs to be changed in some way. This is a contrived example but what if the company keeps track of how many users it is associated with as an attribute on its model?
Basically, if we are in a loop and the reference object needs to be checked or modified every time something is created or updated, then it is valuable to have it as a reference object instead of having to reach into the database every time.
Throw Away the Reference Array (It Rhymes!)
So we have a scenario where we need to keep the reference object around. What if we could empty the target array? We can with reset
!
If we don’t reset the target array, the array will grow to include every object built or created.
But if we reset the target array inside the loop, the reference is thrown away after it is used and the User
object can be garbage collected.
Conclusion
I did not come away from this research with a clear winning solution. I can see use cases for both and I have used both methods to solve different problems. I think, for me, the clear winner was a much better understanding of how ActiveRecord manages objects in memory and how to use that knowledge to build more scalable tools.