Problem

In our Rails application, we had a process that imported a CSV using ActiveRecord. The process would work just fine for CSVs with less than a thousand records. But given a larger CSV, say a hundred thousand records, memory would skyrocket and the app would crash. We wanted to optimize the process to be able to handle an arbitrarily large CSV.

Most discussions I could find online about memory issues and CSVs centered around CSV.parse vs CSV.foreach. The source of the data we were trying to transform was from a CSV but in this case the source is irrelevant. The following discussion will center around ActiveRecord object collection.

Potential Solutions

Resetting the collection:

company = Company.first
CSV.foreach do |row_attributes|
  company.users.create(row_attributes)
  company.users.reset
end

Breaking the reference:

company = Company.first
CSV.foreach do |row_attributes|
  Users.create(company: company, row_attributes)
end

Things I Learned Along the Way

  • Ruby does not let us manually remove an object from memory.
  • ObjectSpace can be used to see what objects are in memory.
  • GC.start will force garbage collection but only for objects that would have already been garbage collected.
  • Objects are garbage collected only when they leave scope.
  • Objects cannot leave scope until all reference objects also leave scope.
  • Objects built or created through an ActiveRecord reference are added to the target array.
  • find_each holds the entire batch in memory and when it fetches a new batch, the old batch is marked for garbage collection. It is not immediately garbage collected so after a few batches of large associations it will usually hold around 3k records.

Discussion

Measuring Tools

It is hard to measure something we can’t see! Memory is one of those things. It is a dark hole of mystery in Ruby and Rails but that is often a huge benefit; we don’t have to spend time thinking about it because it just works. Until it doesn’t.

Lucky for us Ruby gives us ObjectSpace which, among lots of other things, can tell us what kinds of objects are in memory.

This next code snippet was the real hero.

ObjectSpace.each_object(ApplicationRecord).group_by(&:class).each.with_object({}){|(klass, records), counts| counts[klass.to_s] = records.size}.sort_by{|k,_| k.downcase}.each{|k,v| puts "#{k} => #{v}"}

My first assumption was that any memory issues would likely arise out of long running loops. So I put that snippet in every loop one by one so that I could get a count of every ApplicationRecord object that was in memory. The output in the terminal looked like this:

Company => 1
CSV => 1
CSVRows => 1000
User => 1

Company => 1
CSV => 1
CSVRows => 1000
User => 2

# Several iteration later

Company => 1
CSV => 1
CSVRows => 1000
User => 457

Company => 1
CSV => 1
CSVRows => 1000
User => 458

We can solve a problem that we can see! Everything is static but the User objects are growing!

Rails Associations

Objects built or created through a Rails association are added to the target array of the reference object. Because the target array is a reference to a created or built object, as long as the reference object with the target array is in scope the created objects cannot be garbage collected. Since that might not make too much sense, the following code should help us understand.

class Company
  has_many :users
end

company = Company.first # One `Company` object in memory

user1 = company.users.build(name: "Fancy User") # One `User` object in the target array
# company.users.target == [user1]

user2 = company.users.build(name: "Extra Fancy User") # Two `User` objects in the target array
# company.users.target == [user1, user2]

Now multiply the code above by 100,000 and we have a LOT of users in memory. In some cases this is what we want. We just created a user and we are going to do something with that user so we do not want it garbage collected.

But what if we only wrote the loop to create 100,000 user records in our database and will not be referencing them again?

Break the Association

The first approach is to break the association on creation.

Instead of:

company = Company.first

user1 = company.users.build(...)
user2 = company.users.build(...)
# company.users.target == [user1, user2]
# Two `User` objects in the target array

We could use:

company = Company.first

user1 = User.build(company: company, ...)
user2 = User.build(company: company, ...)
# company.users.target == []
# Zero `User` objects in the target array

Problem solved right? Maybe!

If truly all we are doing is creating a record to be saved in a database, I would argue that breaking the association is the best approach. In this instance there is no need to keep every record in memory.

BUT what if every time a user is created, the company object needs to be changed in some way. This is a contrived example but what if the company keeps track of how many users it is associated with as an attribute on its model?

Basically, if we are in a loop and the reference object needs to be checked or modified every time something is created or updated, then it is valuable to have it as a reference object instead of having to reach into the database every time.

Throw Away the Reference Array (It Rhymes!)

So we have a scenario where we need to keep the reference object around. What if we could empty the target array? We can with reset!

If we don’t reset the target array, the array will grow to include every object built or created.

company = Company.first

["Foo", "Bar", "Baz", ...].each do |name|
  company.users.create(name: name)
end
# company.users.target == [user, user, user, ...]

But if we reset the target array inside the loop, the reference is thrown away after it is used and the User object can be garbage collected.

company = Company.first

["Foo", "Bar", "Baz", ...].each do |name|
  company.users.create(name: name)
  company.users.reset
end
# company.users.target == []

Conclusion

I did not come away from this research with a clear winning solution. I can see use cases for both and I have used both methods to solve different problems. I think, for me, the clear winner was a much better understanding of how ActiveRecord manages objects in memory and how to use that knowledge to build more scalable tools.