1. Importing and Exporting Very Large Data Sets in Rails

    23 September 2008

    A recent project has been me pushing beyond the regularly used ActiveRecord methods in Rails. The goal was to import and export very large CSV files with Rails. The example data set I was given was over 850,000 rows. Uncompressed it came out to 46MBs. So let’s explore the process to do this.

    So let’s say for example I want to import a set of users. I expect the import CSV to have the following format:


    email,first_name,last_name

    The TempUser model will have a validation boolean flag column (is_valid) that defaults to ‘true’. If the record fails validation, we set the flag to false and force the record to save. This way all the data can be dumped into out database and we can go back later and sort out which records are valid, and invalid then promote the records to our actual User table.

    Originally, I wrote two simple methods like this:




    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    def import
    import_file = params[:import_file]
    FasterCSV.parse(import_file.read).each do |row|
    temp_user = TempUser.new(:email => row[0], :first_name => row[1], :last_name => row[2])
    unless temp_user.save
    temp_user.is_valid = false
    temp_user.save(false)
    end
    end
    end

    # borrowed from Scott Becker http://synthesis.sbecker.net/articles/2007/06/07/how-to-generate-csv-files-in-rails
    def export
    temp_user = TempUser.find(:all)

    csv_string = FasterCSV.generate do |csv|
    temp_users.each do |user|
    csv << [user.email, user.first_name, user.last_name]
    end
    end

    send_data csv_string,
    :type => 'text/csv; charset=iso-8859-1; header=present',
    :disposition => "attachment; filename=users.csv"
    end


    And this works just fine for small amounts of data. However, benchmarks showed that both import and export were taking over 20 minutes. (I killed the processes at 20) Long after the user-agent timeout. So, how do we solve this….

    Importing


    There are two culprits here. ActiveRecord and MySQL InnoDB tables. InnoDB is slow, and we’re looking for speed. The tradeoff is that MyISAM is not a transactional database. (this could cause issues with your tests) The decision to convert to InnoDB has to be weighed carefully. However, here we’re going to use it. So first, create a migration to alter the table to the MyISAM table:



    1
    2
    3
    4
    5
    6
    7
    8
    9
    class TempUserTableAlteration < ActiveRecord::Migration
    def self.up
    execute 'ALTER TABLE temp_users ENGINE = MyISAM'
    end

    def self.down
    execute 'ALTER TABLE temp_users ENGINE = InnoDB'
    end
    end

    Now for Active Record. For my example data set of over 850,000 rows it is doing 850,000 individual inserts. Not very fast. After consulting with our DBA I was pointed in the direction of the MySql command LOAD DATA INFILE So what this does is allow MySql to import formatted data into the database. It needs access to a file on the system to do this. We can write raw SQL through ActiveRecord::Base.connection.execute(sql) The problem now is how do we validate the record? Well, LOAD DATA INFILE allows us to dynamically set each column value for each record. The only validation I’m doing in the model is to see if the record has a valid email address. We can simply do a regular expression match in MySQL. If it fails, set the is_valid flag to ‘false’. Let’s see how the new import method looks:



    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    def import
    import_file = params[:import_file]
    temp_path = "#{Rails.root}/tmp/#{import_file.original_filename}"
    File.open(temp_path "w") { |file| file.write import_file.read }
    ActiveRecord::Base.connection.execute(load_data_infile(temp_path))
    system("rm", temp_path) # remove the temp file
    end

    private
    def load_data_infile(temp_path)
    <<-EOF
    LOAD DATA INFILE "
    #{temp_path}"
    INTO TABLE temp_users
    FIELDS TERMINATED BY ','
    (email, first_name, last_name)
    SET is_valid = IF(email REGEXP '^[A-Z0-9._%+-]+@[A-Z0-9.-]+
    \.[A-Z]{2,4}$', TRUE, FALSE),
    created_at = '
    #{Time.current.to_s(:db)}',
    updated_at = '
    #{Time.current.to_s(:db)}';
    EOF

    end


    For my example dataset the entire process now takes about 75 - 80 seconds on my development machine. That is a very nice improvement. The user should no longer time out. Let’s look at the exporting…

    Exporting


    My first thought was if LOAD DATA INFILE worked so well for the import isn’t there an equivalent for exporting? There is… SELECT INTO OUTFILE I’m not going to go into the syntax of the command or show an example because for my needs it didn’t work. SELECT INTO OUTFILE writes to the local machine that the database is on. Our setup does not allow for any NFS or FTP on the machine so this wasn’t going to work for us. Time to hack into ActiveRecord…

    We can use the same ActiveRecord::Base.connection.execute(sql) command from before to create a result set for the CSV:




    ActiveRecord::Base.connection.execute("SELECT t.email, t.first_name, t.last_name FROM temp_users t WHERE t.is_valid = true;")

    This will return a MySQL::Result object. Unfortunately there is no way to access the results as a whole, this is something hardcoded into the MySQL Ruby C library. (Sqlite will allow immediate access to a 2-dimensional array) We can only access the results as a queue with the MySQL::Results#each_hash method.

    Another issue is that MySQL will return each record as a hash. This makes it easy for developers to access columns but for us we want as little data returned as possible, iterate over it quickly and spit out CSV. We would normally have to map the hash to an array, join the proper columns in a CSV format and append onto an output file. Or, we could let MySQL do that work:



    1
    2
    3
    def build_sql_for_csv_output
    "SELECT CONCAT(t.email,',',t.first_name,',',t.last_name,'\\n') as csv FROM temp_users t WHERE t.is_valid = TRUE;"
    end

    So now the export process:



    1
    2
    3
    4
    5
    6
    7
    8
    9
      def export
    temp_path = "#{Rails.root}/tmp/Output-#{Time.now.to_s}.csv"

    result = ActiveRecord::Base.connection.execute(build_sql_for_csv_output)
    File.open(temp_path, "w") { |file| result.each_hash { |r| file.write r["csv"] } }

    send_file temp_path, :type => 'text/csv; charset=iso-8859-1; header=present'
    system("rm", temp_path) # remove the temp file
    end

    This new process takes about 75 seconds. Again, a very nice improvement from before. I hope this helps somebody! Feel free to leave any questions/comments. (I’m sure some syntax errors sneaked in there somewhere)