|
I was using Windows' inbuilt program FC.EXE to do file comparisons, but it has a nasty line truncation bug (see below). So I wanted to switch to GNU diff. GNU diff has loads of features, the syntax is platform-independent, and is probably 20 times faster (untested).
Problem: diff inserts a line indicator into the data being compared, to indicate which lines have been added, removed etc. I need clean data. FC doesn't do this, for example, below we have two files, identical, except that the second has the following lines added to the bottom:
! AAA ! BBB - no error ! CCC
First, compare them with FC:
C:\>fc monitor.ini monitor2.ini Comparing files monitor.ini and MONITOR2.INI ***** monitor.ini ***** MONITOR2.INI ! AAA ! BBB - no error ! CCC *****
Now with diff:
C:\>diff -a monitor.ini monitor2.ini 96a97,100 > ! AAA > ! BBB > - no error > ! CCC
Note the leading "> ". There is apparently no way to suppress this (?!!!) so I had to use SED, like this:
diff monitor.ini monitor2.ini | sed -n --text "s/> //p"
This says to sed, "output all lines starting with "> " - but blank the "> " before output". The output of this is:
! AAA ! BBB - no error ! CCC
The search for all lines starting with "> " had the effect of removing the text output by diff itself, which is an added bonus.
diff has much more power than FC - the pipe to sed dumbs it down. The leading "> " would be useful if a human was inspecting the output visually, but I have a script processing the output, I need clean data.
I could also have used awk to strip the leading "> ", like this:
diff -a monitor.ini monitor2.ini | awk "/> / { print substr($0,3) }"
That says to awk, "find all lines that contain "> " and output all of the line starting from the 3rd character".
NOTE: the leading "> " will turn into a leading "< " if the data is in file1 but not file2. If used with the above sed command, nothing will be output, as there are no lines containing "> ". This isn't a problem in my app - in fact it fixes a tricky bug - however it might be a problem in some other application. This should be fixable with a slight tweak to the sed regex. See next section for more on this.
FC outputs differences in blocks, using lines marked with ***** to inform the user which file contains the differences being displayed. In contrast, diff outputs differences line-by-line, and uses line indicators to inform the user which file contains the differences being displayed.
One consequence of this is that if the order of the filenames on the commandline is reversed, with FC there is almost no change to the output - but with diff, the line indicators on every line change direction. This means that with diff, a script must pay attention to the filename ordering, so that the line indicator is facing in the direction the script is expecting.
To see this in action, let's return to the two files, identical, except that the second has the following lines added to the bottom:
! AAA ! BBB - no error ! CCC
First, compare them with FC:
C:\>fc monitor.ini monitor2.ini Comparing files monitor.ini and MONITOR2.INI ***** monitor.ini ***** MONITOR2.INI ! AAA ! BBB - no error ! CCC *****
Now with diff:
C:\>diff -a monitor.ini monitor2.ini 96a97,100 > ! AAA > ! BBB > - no error > ! CCC
Note the leading "> ". Now, let's reverse the filename ordering on the commandline, and re-compare with FC:
C:\>fc monitor2.ini monitor.ini Comparing files monitor2.ini and MONITOR.INI ***** monitor2.ini ! AAA ! BBB - no error ! CCC ***** MONITOR.INI *****
Note that while the asterisk lines have moved around, the actual lines of data are identical. Now to re-compare with diff:
C:\>diff -a monitor2.ini monitor.ini 97,100d96 < ! AAA < ! BBB < - no error < ! CCC
The line indicators are now facing in the opposite direction.
This occurs because the line indicator "<" means "this line is in the first file but not in the second file", while the line indicator ">" means "this line is in the second file but not in the first file".
One potential gotcha, while FC outputs something if the files are identical, diff does not (by default). For example, if the files are the same:
C:\>fc mailx.err mailx.err Comparing files mailx.err and MAILX.ERR FC: no differences encountered C:\>diff -a mailx.err mailx.err C:\>
diff can be told to output something, using the -s switch:
C:\>diff -a -s mailx.err mailx.err Files mailx.err and mailx.err are identical C:\>
To see FC's line truncation bug, consider two files, identical except that one has an extra line added to the bottom:
! mailx 0.17 Apr 30, 2009 12:20:34 scan_PMM: warning: mail from [abcdefgh@myworld.abc.xyz] found in folder [FOL0756C: ___junkmail [spampal]]
FC will produce this output:
C:\>fc mailx.bak mailx.err Comparing files mailx.bak and MAILX.ERR ***** mailx.bak ***** MAILX.ERR ! mailx 0.17 Apr 30, 2009 12:20:34 scan_PMM: warning: mail from [abcdefgh@myworld.abc.xyz] found in folder [FOL0756C: ___junkma l [spampal]] *****
Note how the data was truncated on the 128th character. FC starts a new line on the 128th character and then outputs the rest of the data. In contrast, diff handles this fine:
C:\>diff -a mailx.bak mailx.err 180a181,183 > ! mailx 0.17 Apr 30, 2009 12:20:34 scan_PMM: warning: mail from [abcdefgh@myworld.abc.xyz] found in folder [FOL0756C: ___junkmail [spampal]]
Of course not. FC.EXE is a cheap-assed imitation of diff, just as FIND.EXE is a cheap-assed imitation of grep (see FIND sux, GREP rulez).
Windows is buggy and underpowered and this is just another example of that. I'm reminded of a quote I read recently on Wiki:
"Those who don't understand UNIX are condemned to reinvent it, poorly." – Henry Spencer