The Invisible B.O.M.

Error:

CSV::MalformedCSVError: Illegal quoting in line 1

Possible Solutions:

# If you can pass the BOM encoding as an argument
CSV.foreach(filepath, headers: true, encoding: "bom|utf-8") do |row|
  ...
end

# You can also use the File library to remove the BOM
filepath_without_bom = File.read(filepath, encoding: "bom|utf-8")
CSV.parse(filepath_without_bom, headers: true, encoding: "utf-8") do |row|
  ...
end

# If it turns out not to be a BOM issue
CSV.foreach(filepath, headers: true, encoding: "utf-8", liberal_parsing: true) do |row|
  ...
end

# Another solution for the above but using a predefined quote character
CSV.foreach(filepath, headers: true, encoding: "utf-8", quote_char: "|") do |row|
  ...
end

So what is going on?

After researching all the things, I learned that the MalformedCSVError basically shows up when there are extra quotation marks in the CSV and the parser does not know what to do with them. I imagine there are also other reasons the CSV could be malformed but I got the error because a CSV I was trying to parse was encoded with a smidge of BOM.

I am not going to go into what a BOM is, mostly because even after reading about it I still do not really understand, but if you are interested you can read more about it here. I will say, that if you get the error on any line other than the first, it is highly unlikely to have anything to do with the BOM.

Now what?

The tricky thing about a BOM is that it is a character with no visual representation. In other words, the computer recognizes that there is a character but the text editor has no way of displaying it. So how do you see something that has no character representation? Ask to see its bytes. But I am getting slightly ahead of myself. How did I even know there might be a character I could not see to begin with?

For starters, my CSV was malformed on the first line, which happened to be the headers of the document. I could open the CSV in Numbers just fine and there were no obviously weird characters so Numbers knew what to do with the document. I could open the CSV in my text editor and there was nothing out of the ordinary there either.

I also tried the liberal parsing example from above and using it was the first time I stopped getting an error. Now I could actually see output from the file and inspect its contents. So I printed the first line to the console and got something that looked like the following:

"\"Date\"","Name","Street Address","City","State","Zip","Phone"

This was my first indication that there was something wrong that I could not see but I needed to dig a little deeper. I needed to see all the individual characters of that first element, "\"Date\"".

File.readlines("thefile.csv").first.chars.to_a
# ["","D","a","t","e"...]

Huh. Why is that "" the first element of the array?

File.readlines("thefile.csv").first.chars.to_a.first.bytes
# [239, 187, 191]

Well would you look at that. A quick google search of those numbers revealed they were the BOM of UTF-8 encoding.

[239, 187, 191].pack('c*')
# "\xEF\xBB\xBF"
"\xEF\xBB\xBF"
# ""
_.bytes
# [239, 187, 191]

Now, why the Ruby CSV parser does not immediately recognize the BOM when the encoding is explicitly given makes all of zero sense to me. But there it was clear as day. I removed the BOM before sending the file to be parsed and everything from there worked exactly as expected.