Thursday, June 01, 2006

EOL character transformations in CSV files on Solaris

I had a CSV dump from SQL Server tables that I needed to load into Oracle using SQL*LDR. The first, and the most common problem faced is the presence of a ^M at the end of each line. This crept in when a file is FTPed from DOS to UNIX. the dos2unix command takes care of this very easily.

The second and more frustrating problem was that of newline characters in user-entered fields, such as the 'comments' field. Any sql*ldr script failed on such lines for obvious reasons. Looking at the octal dump of the file using 'od -c ', I noticed that user comments newlines had a '\n', while real EOL newlines had a '\r' and then a '\n'. This might vary in different flavors of Unix, but Solaris 8 showed the preceding behavior.

All I had to do was find all '\n' without a preceding '\r'. This is simple using a search and replace in sed with lookbehind (?<!char), right? Wrong. Solaris doesn't support lookbehind. (Their regex syntax by itself is pretty weird, but that needs more investigation). So I had to do a byte replace in two steps.
Step 1) Replace all '\n' with a space or whatever character you'd like.
Step 2) Now that only '\r' is remaining, replace this with '\n' (the real EOL newlines). The commands used were:

od -c <filename>

1) tr '\n' ' ' < <oldfile> > <tempfile>
2) tr '\r' '\n' < <tempfile> > <newfile>


Post a Comment

<< Home


View My Stats