2 files: finding lines that are in only 1 of those files

Today, I learned a trick at work.

I had a list of phone numbers in a file, and wanted to obtain a second list of phone numbers: the ones that were in a table in our database, but not in the given file.

Of course, you can solve this in SQL, for example with a “NOT IN” clause. However, with tens of thousands of phone numbers involved, this would probably be rather slow. Besides, I’m not very good at SQL, and I don’t like reading the PostgreSQL documentation.

A colleague mentioned a trick with cat, sort and uniq. If you do cat filename | sort | uniq -u, only lines that appear once in filename will be printed. So, if you have 2 files, file1 and file2, neither of them having duplicate lines, and you want to know which lines are in file2 but not in file1, you do this:

cat file1 file1 file2 | sort | uniq -u

Lines that appear only in file1 are filtered out by uniq’s -u option, as are lines that appear both in file1 and file2. So you end up with the lines that appear only in file2.

I made file containing a list of all phone numbers in our database, and used the above trick to get the phone numbers that were in that file, but not in the file with the limited list of phone numbers. Fast, and easy.

This entry was posted in linux, unix. Bookmark the permalink.

One Response to 2 files: finding lines that are in only 1 of those files

  1. Arjen says:

    The comm utility is made for this sort of thing. It takes two sorted files and displays the lines only in file 1, only in file 2 and the common lines in 3 columns. To get only the lines only in file 2, suppress the display of the other 2 columns:

    comm -13 <(sort file1) <(sort file2)

Leave a Reply

Your email address will not be published.