Sometimes it happens that we have a really big .csv (~100 MB) file and we need to import it to our RoR application database. We can try to use the built-in class:
require 'csv' def import(file) CSV.foreach(file.path) do |row| product = Product.from_csv_row(row) product.save! end end
However, it will take too much time for .csv files to be imported. What is more, you will have to import the .csv file again in case some rows are broken.
So how to solve this problem?
If your models are not linked you can just use activerecord-import as mass insertion is much faster in this case.
require 'csv' def import(file) products = [] CSV.foreach(file.path) do |row| products << Product.from_csv_row(row) end Product.import products end
If the models are linked let’s try some tricks.
At first, I would advise to split the full import process into chunks. It gives 4 advantages:
1. You don’t have to wait forever untill .csv is opened.
2. You don’t have to start from the begining if something goes wrong.
3. You can monitor process status.
4. It is easier to debug.
Secondly, it’s better to keep the last inserted id from .csv (or just row number) in your database.
class AddCsvRowId < ActiveRecord::Migration def change change_table :products do | t | t.integer :csv_row_id end end end
I have used SmarterCSV gem. It allows you to read .csv files by chunks. Therefore we get
def import(file) total_chunks = SmarterCSV.process( file.path, {chunk_size: 10000} ) do |chunk_products| if Product.last.present? chunk_products = chunk_products.select {|product| product[:id].to_i > TimeBlock.last.csv_row_id } end chunk_products.each do |row| next if Product.exists?(csv_row_id: row[:id].to_i) product = Product.from_csv_row(row) next unless product.valid? product.images << Image.from_csv_row(row) product.tags << Tag.from_csv_row(row) product.save! end end puts total_chunks.to_s end
And finally, it is better to move import to background process.
task import_products: :environment do ImportProductsJob.perform_later('/path/to/file.csv') end class ImportProductsJob < ActiveJob::Base rescue_from(StandardError) do retry_job wait: 5.minutes, queue: :low_priority end def perform(*args) import(args[0]) if File.exist?(args[0]) && File.extname(args[0]) == 'csv' end def import(file) # process file end end
Conclusion
So .csv file import is not that diffucult if you have some basic tricks up your sleeve and use these links:
2. SmarterCsv
3. ActiveJob