Yury Velikanau's Blog

Adventures with Ruby and Rails

Normalizing Paperclip’s Filenames

Using Paperclip you might have noticed that it doesn’t change any of your file names by default. You might find some of its interpolations useful in case you don’t need human readable file names.

In case you do want human readable file names you can simply use :basename.:extension or :filename interpolations.

However, there is one thing you should keep in mind. Let’s say someone uploaded “foo bar.jpg”. Yes, with space in its name. Later on, when your application build a URL for that file, that space will be encoded into %20. So when user’s browser will try to fetch that file it could fail, because your application doesn’t care about such cases or your CDN provider doesn’t care.

But we should care, because showing images is important for business and we don’t want to build walls of rules around our users.

One of the possible solutions would be normalization of file names to store them without any special symbols.

The simplest way is to add your own interpolation using Paperclip’s API. This would still keep the original file name in the database, but change the real file name to what you want, running that interpolation every time you build URL for a file.

We decided that it’s more useful to have already normalized file names in the database, so the way to achieve that is a little bit different.

I started from designing the class that will take care of file name normalization and its spec.

1
2
3
4
5
6
7
8
9
require "spec_helper"
describe Assets::Filename do
it "returns normalized filename" do
Assets::Filename.normalize("test%file 1.JPG").should eql("test-file-1.jpg")
end
end
lib/assets/filename.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
module Assets
# This class does normalization of any string passed into `normalize`.
# With help of `ActiveSupport#parameterize` all special characters that
# don't conform URL standard will be replaced by dashes.
# String passed will be also downcases.
#
# === Example
#
# Filename.normalize("Qwe%%ty 1.jPg")
# => "qwe-ty-1.jpg"
#
class Filename
def self.normalize(name)
self.new(name).normalize
end
def initialize(name)
@name = name
end
def normalize
"#{file_name}#{ext_name}"
end
private
def file_name
File.basename(@name, File.extname(@name)).parameterize
end
def ext_name
File.extname(@name).downcase
end
end
end

As you see it’s simple and use one of ActiveSupport helper method. parameterize does exactly I need.

Then I needed the way to force file name normalization when you save any record that contains Paperclip attachment.

This article and Paperclip’s source code gave me some insight.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
require "spec_helper"
describe Assets::Normalizer do
include ActionController::TestProcess
class FakeImage < Asset
has_attached_file :attachment
end
it "normalizes filename" do
FakeImage.any_instance.stubs(:save_attached_files).returns(true)
Paperclip::Attachment.any_instance.stubs(:post_process).returns(true)
FakeImage.create(
attachment: fixture_file_upload('/test%image 1.jpg')
).attachment_file_name.should eql("test-image-1.jpg")
end
end

fixture_file_upload is Rails helper method to ease testing of file upload.

Because I don’t want to do any real file upload I stub out post_process method.

lib/assets/normalizer.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
module Assets
module Normalizer
def self.included(base)
base.send :before_save, :normalize_filename
end
private
def normalize_filename
each_attachment do |name, attachment|
attachment.instance_write(
:file_name,
Assets::Filename.normalize(attachment.instance_read(:file_name))
)
end
end
end
end

Normalizer is a simple module with some of Ruby magic. self.included(base) method let’s you configure behaviour of a class that includes this module. In this particular case I just create before_save callback that runs normalization.

In normalize_filename method, each_attachment is a Paperclip’s method that lets you iterate over all has_attached_file definitions. Using class above and Paperclip’s API I change name of the file and it gets saved in normalized version.

All I have to do now is just include this module in Asset class

1
2
3
class Asset
include Assets::Normalizer
and

If you use STI as we do, you don’t need to do anything else, because normalize_filename method will be inherited as well as callback. If you just have attachments in different classes you can include this module there.

Your interpolations remain untouched.

This approach lets you do anything you want with file names, let’s say randomize their names as in article mentioned above.

Changing Paperclip’s Directory Structure

Paperclip is well known gem that adds image upload to your application. Many applications use it so do we.

In fact, once you got it working according to your business rules you can forget about it. So did we for two years. Our image upload volume was low, however it’s dramatically increased recently.

We were using pretty standard way of storing images, like that

1
"/assets/images/:id/:style/:basename.:extension"

However, use of this folder structure leads to one issue that hard to notice in the beginning - limit of sub-directories per one directory in some filesystems. So we decided to change it in advance before we reach any real issues with that.

Paperclip actually has a good interpolation for that, however it’s not used by default.

:id_partition is that important piece that won’t let your image directories reach any limits. Given image ID = 25500, this interpolation will create 3 directories 000/025/500 for every image, so you’ll have 1000 directories in one directory at most.

So I came up with new directory structure like this

1
"/assets/:class/:id_partition/:basename-:style.:extension"

Then I needed to figure out how to migrate tons of existing images to new directory structure. If you simply change your interpolations, Paperclip will start building path to image according to new rules, however you’ll still have those images in directory structure you had, only the new ones will be uploaded into correct directories.

You can write something that will move files into correct places. However, you can achieve that easier and have more flexibility using Paperclip in your script.

I came up with script below. It’s huge, but look throught it and I’ll explain some of its parts later on.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
require "logger"
Thread.abort_on_exception = true
namespace :anything_you_want do
namespace :assets do
desc "Migrate assets to new folder structure"
task :migrate, :workers, :needs => :preload do |t, args|
args.with_defaults(:workers => 4)
logger = Logger.new(Rails.root.join('log', 'migration.log'))
logger.formatter = Logger::Formatter.new
migration_folder = Rails.root.join('public', 'assets_migration')
queue = Queue.new
Dir.entries(migration_folder).each do |entry|
# Folder is an Asset ID.
asset_id = File.basename(entry)
# We're not interested in these folders.
next if %w(. ..).include?(asset_id)
unless File.directory?(File.join(migration_folder, entry))
logger.info("SKIP #{asset_id}: is not a directory.")
next
end
queue.push(asset_id)
end
workers = []
args[:workers].to_i.times do |i|
workers[i] = Thread.new do
while !queue.empty? && asset_id = queue.pop
begin
asset = Asset.find(asset_id)
rescue ActiveRecord::RecordNotFound
logger.warn("SKIP #{asset_id}: couldn't find record!")
next
rescue ActiveRecord::SubclassNotFound
logger.info("CLEANUP #{asset_id}: asset without subclass is an orphaned record.")
Asset.delete(asset_id)
next
rescue => e
logger.error("FAILED QUEUEING #{asset_id}: unknown error.\nException: #{e.inspect}\nBacktrace: #{e.backtrace}")
next
end
begin
File.open(image_path(asset), "rb") do |image|
logger.info("PROCESSING #{asset_id}")
asset.attachment = image
asset.save(false)
end
rescue Errno::ENOENT
logger.warn("SKIP #{asset_id}: no such file!")
next
rescue => e
logger.error("FAILED PROCESSING #{asset_id}: unknown error.\nException: #{e.inspect}\nBacktrace: #{e.backtrace}")
next
end
end
end
end
workers.each{ |w| w.join }
logger.close
end
task :preload => :environment do
$LOAD_PATH.unshift Rails.root.join('app', 'models')
%w(asset and_other_sti_classes).each do |klass|
require klass
end
end
# Path to image with old style path.
def image_path(asset)
options = { path: ":rails_root/public/assets_migration/:id/:style/:basename.:extension" }
Paperclip::Attachment.new(:attachment, asset, options).path(:original)
end
end
end

The first thing I’d like to point out is logger library. I wanted to use separate log file with some timestamps.

1
2
logger = Logger.new(Rails.root.join('log', 'migration.log'))
logger.formatter = Logger::Formatter.new

You can adjust time format passing a format string to Logger::Formatter.new.

The second thing is Thread.abort_on_exception = true. This is important for debugging, otherwise your script will fail only in the very end, waiting for other threads to finish.

The next thing is to collect folder names, that are actually IDs of your assets, to a queue. I use Queue class here, because it’s safe way to syncronize queue among threads. Code is very simple, it just iterates over all directories and put their names into queue.

The next thing is threads. The first version of script didn’t have threads, however when I run it and calculated time to complete I got 18 hours. This was too long and unitilization of CPU and memory was very low. So I introduced some threads to speed up the process. With four threads estimated time was 8 hours. Not ideal, but this is something you can work with. Unfortunately, using more threads caused deadlocks.

Threads. Every thread takes an asset ID from the queue, gets the record from database, gets the file from directory where I put all existing assets, assings it to the record and just saves it. The rest of the job is done by Paperclip. With my configuration, Paperclip was saving files locally, but you can do something similar and save all files to S3 or other storage you use.

There is few things you might be interested in. Moving things around many times we’ve got some inconsistencies between database and actual files, so those rescue is a way to get rid of this.

rescue ActiveRecord::RecordNotFound will not try to process any files that don’t have records in database.

rescue ActiveRecord::SubclassNotFound will not try to process any records that don’t have STI class defined.

To get correct object with correct file path with less effort I just build attachment object in the memory passing different interpolations for path and then assign it to the record’s attachment which makes Paperclip happily process old file and save everything to new directory structure.

1
2
3
4
def image_path(asset)
options = { path: ":rails_root/public/assets_migration/:id/:style/:basename.:extension" }
Paperclip::Attachment.new(:attachment, asset, options).path(:original)
end

And the last piece is weird preload task. By some reason the first thread couldn’t find any of the classes defined, so I preload them before running migration.

It looks a bit ugly, but it’s good enought for one time migration from one directory structure to another, moreover using Paperclip you are flexible enought to upload files to S3 or use different processor, or… you name it.

Import Images From AWS S3 With Paperclip

I had a task to create bulk import of images from AWS S3 and attach them to existing records. In our case Paperclip doesn’t use S3 storage so we use S3 to only import images uploaded by photographers.

Doing any kind of manipulations with files and Paperclip is pretty easy. I use Official AWS SDK to get images from S3, read them into temporary file and let Paperclip do the rest of job for me.

This is simple example how to attach S3 file to one of your existing records

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# variables
# file - S3 object
# record - AR record with Paperclip's has_attachment
file_name = File.basename(file.key)
temp_file = Tempfile.new(file_name.split(/(.\w+)$/))
begin
temp_file.binmode
temp_file.write file.read
record.images.create(:attachment => temp_file)
rescue => e
# do your own error handling here
ensure # ensure we don't keep dead links
temp_file.close
temp_file.unlink
end

with this we read contents of a file from S3 into temporary file on our server and then just let Paperclip use that file to create all images for all sizes and put them to storage we use.

Easy! However this code has one small issue which was very important for my task - file name. We want retain original file name.

Let’s say I’m importing unicorn.jpg from S3 and I assumed that final file name will be unicorn.jpg, but it’s not because of the way Tempfile works

1
2
1.9.2 (main):0 > Tempfile.new(["unicorn", ".jpg"])
=> #<File:/var/folders/2w/9y_pr4vx17s5p4cjzkzs5pnw0000gn/T/unicorn20110921-99183-94s9ha.jpg>

as you see Tempfile changes the name of file to keep it unique and Paperclip of course uses this file name.

After quick digging into Paperclip’s internals you may find that if File object responds to original_filename method, then Paperclip use the name provided by this method.

Let’s create that method

1
2
3
4
5
6
7
8
class ImageTempfile < Tempfile
attr_reader :original_filename
def initialize(file_name)
@original_filename = file_name
super(file_name.split(/(.\w+)$/))
end
end

and change code that does import to use our new class

1
temp_file = ImageTempfile.new(file_name)

Paperclip now use our pretty file names instead of those that Tempfile gives you. That’s it.

Rails 2, MySQL 2, and Stored Procedures

Rails community doesn’t like deriving business logic to the database, but in some cases stored procedures are very helpful and many people trying to use them in Rails, however it’s not so easy as you can imagine.

Running ActiveRecord::Base.connection.execute("CALL proc01") will give you a bunch of errors in different cases.

Let’s say your procedure returns some result set. So running that procedure will give you exception

1
ActiveRecord::StatementInvalid: Mysql2::Error: PROCEDURE can't return a result set in the given context

In case your procedure doesn’t return any result set, then running it twice will give you another exception

1
ActiveRecord::StatementInvalid: Mysql2::Error: Commands out of sync; you can't run this command now

In other case when stored procedure doesn’t return any result set at all, you’ll get NoMethodError.

All these issues are well known, however they aren’t fixed yet, even in Rails 3.

Let’s look at the first issue. When MySQL runs stored procedure it has to know that client can handle multiple result sets. By default MySQL assumes that client cannot handle this unless you set CLIENT_MULTI_RESULTS flag when establishing connection to MySQL server. It’s not a surprise that neither Rails or MySQL2 doesn’t do this, because in most projects you don’t need multiple result sets. In the future we’ll probably have an option to configure this, but until then let’s create a workaround.

We use MySQL2. Its latest 0.2.6 gem release is kind of outdated for Ruby 1.9.2, so we did a fork from edge version at some stable point. MySQL2 defines its own mysql adapter for Rails in lib/active_record/connection_adapters/mysql2_adapter.rb. We’re interested in a method that creates connection object:

lib/active_record/connection_adapters/mysql2_adapter.rb
1
2
3
4
5
6
7
8
9
10
11
def self.mysql2_connection(config)
config[:username] = 'root' if config[:username].nil?
if Mysql2::Client.const_defined? :FOUND_ROWS
config[:flags] = Mysql2::Client::FOUND_ROWS
end
client = Mysql2::Client.new(config.symbolize_keys)
options = [config[:host], config[:username], config[:password], config[:database], config[:port], config[:socket], 0]
ConnectionAdapters::Mysql2Adapter.new(client, logger, options, config)
end

This place looks good to put our additional flag for MySQL, but wait! There is other flags already, so let’s just re-use this and let adapter pass it further.

Create a file in config/initializers with the following content:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
module ActiveRecord
class Base
# Overriding ActiveRecord::Base.mysql2_connection
# method to allow passing options from database.yml
#
# Example of database.yml
#
# login: &login
# socket: /tmp/mysql.sock
# adapter: mysql2
# host: localhost
# encoding: utf8
# flags: 131072
#
# @param [Hash] config hash that you define in your
# database.yml
# @return [Mysql2Adapter] new MySQL adapter object
#
def self.mysql2_connection(config)
config[:username] = 'root' if config[:username].nil?
if Mysql2::Client.const_defined? :FOUND_ROWS
config[:flags] = config[:flags] ? config[:flags] | Mysql2::Client::FOUND_ROWS : Mysql2::Client::FOUND_ROWS
end
client = Mysql2::Client.new(config.symbolize_keys)
options = [config[:host], config[:username], config[:password], config[:database], config[:port], config[:socket], 0]
ConnectionAdapters::Mysql2Adapter.new(client, logger, options, config)
end
end
end

So now you can pass any additional options from your database.yml. See that 131072? This is the value of CLIENT_MULTI_RESULTS constant. Not so clear, because you have to know those magic numbers, but OK for the beginning.

If you want to pass more options, remember that you must use bitwise OR operator, so in database.yml it will be

database.yml
1
flags: <%= 65536 | 131072 %>

where 65536 is the value of CLIENT_MULTI_STATEMENTS constant. BTW, enabling only CLIENT_MULTI_STATEMENTS will automatically enable CLIENT_MULTI_RESULTS.

RSpec’s 1.3.1 Exit Codes

As you know all CI tools rely on command exit status, if you have 0 then your build is OK, if you have something more than 0 then your build’s failed.

You can simply see that by running any command and when is’t finished run

1
echo $?

which tells you the last exit code.

Configuring CI server I found that all our our builds are passing, regardless the fact that we have some failing specs. Any combination of specs wether passing or failing returned exit code 0.

I don’t know how come I didn’t find this issue, so at first I was blaming our code and with help of my colleague we begun digging into this.

There is not too many ways to override exit code. As Ruby documentation says you can intercept SystemExit exception or define your own object finalizers - at_exit and ObjectSpace.define_finalizer.

However we didn’t find anything in the project so my next victim was RSpec and I found that Runner calls at_exit

1
2
3
def autorun # :nodoc:
at_exit {exit run unless $!}
end

Trying to find a way to fix this I went to RSpec issues, just to see what’s new and found one month old issue with the same problem and possible solution.

I’ve applied that to RSpec, run its specs and everything looks good so far. So if you’re struggling without proper exit codes you can create a monkey patch and put it in your spec/support folder

1
2
3
4
5
6
7
8
9
10
11
12
13
module Spec
module Runner
# This monkey-patch apply fix to force RSpec return
# proper exit codes.
# https://github.com/dchelimsky/rspec/issues#issue/12
def self.autorun
at_exit {
next if $!
at_exit {exit run}
}
end
end
end

so now you’ll have normal exit codes!

UPDATE: This issue has been fixed in Ruby 1.9.2-p180.

Bundler in Subshells

Few weeks ago I was trying to make Integrity work with Ruby 1.9.2 and Bundler. It’s a well known CI tool, but kind of abandoned. I thought making it work with Ruby 1.9.2 could be tough, however the real problem was in Bundler.

Integrity use Bundler to manage its dependencies. When Integrity runs a build it opens new subshell where your project is building.

Here is how Integrity does it

1
2
3
4
5
6
7
8
9
10
def run(command)
cmd = normalize(command)
@logger.debug(cmd)
output = ""
IO.popen(cmd, "r") { |io| output = io.read }
Result.new($?.success?, output.chomp)
end

The issue arise when your project use Bundler too. Who doesn’t? In this case your project is trying to use Integrity’s Gemfile which is not that you wanna do. Integrity should use its own Gemfile as well as your project should use its own.

This happens because Bundler change your environment to do what it does. It sets BUNDLE_GEMFILE variable which points to Integrity’s Gemfile. Even when Integrity runs your project in subshell this variable is there, because subshell inherit its parent environment.

Looking for solution on the web you can find recommendations to use Bundler.with_clean_env method, however this was working solution for old Bundler versions I guess. With modern versions it doesn’t help, because this method doesn’t cleanup BUNDLE_GEMFILE variable. Moreover, Bundler sets and doesn’t cleanup two more variables - RUBYOPT and BUNDLE_BIN_PATH. So unless you have these variables in subshell you’ll keep using Integrity’s gems.

To avoid this I went almost the same way as with_clean_env does - replace current environment with the one you want and restore it when command is finished, in the same time removing those three variables.

1
2
3
4
5
6
7
8
9
BUNDLER_VARS = %w(BUNDLE_GEMFILE RUBYOPT BUNDLE_BIN_PATH)
def with_clean_env
bundled_env = ENV.to_hash
BUNDLER_VARS.each{ |var| ENV.delete(var) }
yield
ensure
ENV.replace(bundled_env.to_hash)
end

So now run method should look like this

1
2
3
4
5
6
7
8
9
10
11
12
def run(command)
cmd = normalize(command)
@logger.debug(cmd)
output = ""
with_clean_env do
IO.popen(cmd, "r") { |io| output = io.read }
end
Result.new($?.success?, output.chomp)
end

and your build will be using its own Gemfile.

This made Integrity work with Bundler. However if you faced something similar in your project using subshells you could do the same trick.

Good news that with Bundler 1.1 we’ll probably have new method to really clean environment from Bundler.

In regards to Integrity, there is still some issues related to Bundler and RVM, however it works in simple cases.

If you wanna help in reviving Integrity, please don’t hesitate to test it and report bugs, or help in development :)

My fork of Integrity lives here.

Git GUI for MacOS

Do you use any Git GUI tool? I do. Although, git command line has everything I need for comfortable work, I still prefer GUI tools. Linux users have Gitk - pretty ugly, but powerful tool, likely there is a clone of this tool for MacOS - Gitx.

Gitx was released in 2008 with as a fairly simple, but prospective clone of Gitk, however a year later its development stopped. Not a bad news though, because it has a network of more than 100 forks and there is one far more advanced (experimental) version of Gitx.

Here is just some of its features:

  • fetch
  • pull
  • push
  • add remote
  • merge
  • cherry-pick
  • rebase
  • clone

And screenshot:

Get the latest stable version here: http://github.com/brotherbard/gitx/downloads or compile your own edge version by cloning the repo and building it in XCode.

Don’t Generate Rdoc and Ri Documentation for Gems Using RVM

Personally, I don’t like generation of rdoc and ri documentation when using RVM, because it slows down gems installation and I don’t need documentation for the same gem under different versions of Ruby.

I can always disable this manually (which is not convenient)

1
gem install some_gem --no-rdoc --no-ri

or do the same by adding

1
:gem: --no-ri --no-rdoc

into my ~/.gemrc, however it doesn’t help when I work under RVM. Likely, gem command checks for global settings in /etc/gemrc file, so adding

1
gem: --no-ri --no-rdoc

into /etc/gemrc helps to solve this problem. Now every gem installation under any Ruby version will use --no-rdoc --no-ri options.

Rails Edge: ActiveSupport::Concern

Watching Rails 3 Edge commits I’ve noticed an addition to ActiveSupport - ActiveSupport::Concern made by Neeraj Singh. In fact, his commit is just a piece of documentation and ActiveSupport::Concern itself was added by Joshua Peek about a year ago. Shame on me I didn’t know this.

What it does? It’s a nice extension of Module that let’s you adding instance or class methods to a class, call its methods, etc.

The old Ruby way of doing this (and you still have to follow this way in pure Ruby):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
module M
def self.included(base)
base.send(:extend, ClassMethods)
base.send(:include, InstanceMethods)
scope :foo, :conditions => {:created_at => nil}
end
module ClassMethods
def cm; puts 'I am class method'; end
end
module InstanceMethods
def im; puts 'I am instance method'; end
end
end

Using new Rails Edge way, the module above can be rewritten as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
module M
extend ActiveSupport::Concern
included do
scope :foo, :conditions => { :created_at => nil }
end
module ClassMethods
def cm; puts 'I am class method'; end
end
module InstanceMethods
def im; puts 'I am instance method'; end
end
end

Let’s say we extend ActiveRecord::Base with module M:

1
2
3
ActiveSupport.on_load(:active_record) do
include M
end

…and as a result we’ll have all ActiveRecord::Base classes with class method cm, instance method im and scope foo.

Examples were taken from Rails documentation.

Check out source of ActiveSupport::Concern.

Ruby 1.9 Can Help Quickly Find Syntax Errors

Working on a Rails project I got an error that every Ruby developer knows syntax error, unexpected $end, expecting keyword_end. Usually it means that someone left out end keyword somewhere in the code. I quickly went through the code and found no sign of that, but Ruby still points at the end of one hundred lines of code file.

1
2
spec:~ spectator$ rake db:migrate --trace
lib/finder.rb:118: syntax error, unexpected $end, expecting keyword_end

Likely, Ruby 1.9 has w option to turn warnings on for your script, so I did

1
2
3
spec:~ spectator$ ruby -wc lib/finder.rb
lib/finder.rb:118: warning: mismatched indentations at 'end' with 'def' at 32
lib/finder.rb:118: syntax error, unexpected $end, expecting keyword_end

and that’s it! Now I know where to look exactly for the error and quickly found it. Someone left out a dot at the end of a very long methods chain.