Thursday, September 23, 2010

AWK

Hello everyone,

Two days ago I was confronted with a simple task: We have around 50 CSV files that should be concatenated into one file. Every file has a header at the first line that should be eliminated except for the first one. To solve the problem I used  a different way that the one that I will present to you today: streamed the files through sed deleting all the headers and putting the result into a temp file and then add a header to that file but today I was thinking that it could be a good opportunity to use AWK [1] for the task and write to the list about it.

Let's see how defines a program in AWK one of its creators, Alfred Aho: "An AWK program is of a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed."

So we could create a program in AWK that prints the header only one time (very easy task with the use of a flag) discarding all the others headers and printing all the others lines. This is the solution that I found:


/^set,/ { if (!printed) { print; printed = 1 } next }
        { print }

And it was run this way:

awk -f p *.csv > all

p is the file with the AWK program.

The first line of the program matches a header, prints it and sets a flag var then stops processing the current record and reads the next record starting with the first pattern again. The second line of the program is always executed printing the record (line) with the exception.

These are the times for a total of 160K lines:

real    0m0.116s
user    0m0.060s
sys     0m0.052s

Have a good night and remember, TIMTOWTDI.


--
J. E. Aneiros
GNU/Linux User #190716 en http://counter.li.org
perl -e '$_=pack(c5,0105,0107,0123,0132,(1<<3)+2);y[A-Z][N-ZA-M];print;'
PK fingerprint: 5179 917E 5B34 F073 E11A  AFB3 4CB3 5301 4A80 F674

No comments: