-
Importing and Exporting Very Large Data Sets in Rails
23 September 2008A recent project has been me pushing beyond the regularly used ActiveRecord methods in Rails. The goal was to import and export very large CSV files with Rails. The example data set I was given was over 850,000 rows. Uncompressed it came out to 46MBs. So let’s explore the process to do this.
So let’s say for example I want to import a set of users. I expect the import CSV to have the following format:
email,first_name,last_name
The TempUser model will have a validation boolean flag column (is_valid) that defaults to ‘true’. If the record fails validation, we set the flag to false and force the record to save. This way all the data can be dumped into out database and we can go back later and sort out which records are valid, and invalid then promote the records to our actual User table.
Originally, I wrote two simple methods like this:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25def import
import_file = params[:import_file]
FasterCSV.parse(import_file.read).each do |row|
temp_user = TempUser.new(:email => row[0], :first_name => row[1], :last_name => row[2])
unless temp_user.save
temp_user.is_valid = false
temp_user.save(false)
end
end
end
# borrowed from Scott Becker http://synthesis.sbecker.net/articles/2007/06/07/how-to-generate-csv-files-in-rails
def export
temp_user = TempUser.find(:all)
csv_string = FasterCSV.generate do |csv|
temp_users.each do |user|
csv << [user.email, user.first_name, user.last_name]
end
end
send_data csv_string,
:type => 'text/csv; charset=iso-8859-1; header=present',
:disposition => "attachment; filename=users.csv"
end
And this works just fine for small amounts of data. However, benchmarks showed that both import and export were taking over 20 minutes. (I killed the processes at 20) Long after the user-agent timeout. So, how do we solve this….Importing
There are two culprits here. ActiveRecord and MySQL InnoDB tables. InnoDB is slow, and we’re looking for speed. The tradeoff is that MyISAM is not a transactional database. (this could cause issues with your tests) The decision to convert to InnoDB has to be weighed carefully. However, here we’re going to use it. So first, create a migration to alter the table to the MyISAM table:1
2
3
4
5
6
7
8
9class TempUserTableAlteration < ActiveRecord::Migration
def self.up
execute 'ALTER TABLE temp_users ENGINE = MyISAM'
end
def self.down
execute 'ALTER TABLE temp_users ENGINE = InnoDB'
end
end
Now for Active Record. For my example data set of over 850,000 rows it is doing 850,000 individual inserts. Not very fast. After consulting with our DBA I was pointed in the direction of the MySql command LOAD DATA INFILE So what this does is allow MySql to import formatted data into the database. It needs access to a file on the system to do this. We can write raw SQL through ActiveRecord::Base.connection.execute(sql) The problem now is how do we validate the record? Well, LOAD DATA INFILE allows us to dynamically set each column value for each record. The only validation I’m doing in the model is to see if the record has a valid email address. We can simply do a regular expression match in MySQL. If it fails, set the is_valid flag to ‘false’. Let’s see how the new import method looks:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20def import
import_file = params[:import_file]
temp_path = "#{Rails.root}/tmp/#{import_file.original_filename}"
File.open(temp_path "w") { |file| file.write import_file.read }
ActiveRecord::Base.connection.execute(load_data_infile(temp_path))
system("rm", temp_path) # remove the temp file
end
private
def load_data_infile(temp_path)
<<-EOF
LOAD DATA INFILE "#{temp_path}"
INTO TABLE temp_users
FIELDS TERMINATED BY ','
(email, first_name, last_name)
SET is_valid = IF(email REGEXP '^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$', TRUE, FALSE),
created_at = '#{Time.current.to_s(:db)}',
updated_at = '#{Time.current.to_s(:db)}';
EOF
end
For my example dataset the entire process now takes about 75 - 80 seconds on my development machine. That is a very nice improvement. The user should no longer time out. Let’s look at the exporting…Exporting
My first thought was if LOAD DATA INFILE worked so well for the import isn’t there an equivalent for exporting? There is… SELECT INTO OUTFILE I’m not going to go into the syntax of the command or show an example because for my needs it didn’t work. SELECT INTO OUTFILE writes to the local machine that the database is on. Our setup does not allow for any NFS or FTP on the machine so this wasn’t going to work for us. Time to hack into ActiveRecord…
We can use the same ActiveRecord::Base.connection.execute(sql) command from before to create a result set for the CSV:ActiveRecord::Base.connection.execute("SELECT t.email, t.first_name, t.last_name FROM temp_users t WHERE t.is_valid = true;")
This will return a MySQL::Result object. Unfortunately there is no way to access the results as a whole, this is something hardcoded into the MySQL Ruby C library. (Sqlite will allow immediate access to a 2-dimensional array) We can only access the results as a queue with the MySQL::Results#each_hash method.
Another issue is that MySQL will return each record as a hash. This makes it easy for developers to access columns but for us we want as little data returned as possible, iterate over it quickly and spit out CSV. We would normally have to map the hash to an array, join the proper columns in a CSV format and append onto an output file. Or, we could let MySQL do that work:1
2
3def build_sql_for_csv_output
"SELECT CONCAT(t.email,',',t.first_name,',',t.last_name,'\\n') as csv FROM temp_users t WHERE t.is_valid = TRUE;"
end
So now the export process:1
2
3
4
5
6
7
8
9def export
temp_path = "#{Rails.root}/tmp/Output-#{Time.now.to_s}.csv"
result = ActiveRecord::Base.connection.execute(build_sql_for_csv_output)
File.open(temp_path, "w") { |file| result.each_hash { |r| file.write r["csv"] } }
send_file temp_path, :type => 'text/csv; charset=iso-8859-1; header=present'
system("rm", temp_path) # remove the temp file
end
This new process takes about 75 seconds. Again, a very nice improvement from before. I hope this helps somebody! Feel free to leave any questions/comments. (I’m sure some syntax errors sneaked in there somewhere)