Understanding Enough awk to Search Piles of Files and Text
Command line tools are obviously useful, but more often than not they're JAM packed with so much functionality that it can be hard to get started. "Well, check the man page fool." Oh, okay. Thanks that's helpful. Don't get me wrong, man pages are an excellent resource, but a lot of the time they're just an alphabet soup of overwhelm.
The things is, for a lot of these tools, you don't need to know 100% of it to be productive. You can learn that small 20% that lets you get 80% of the work done. And so today, let's talk about awk
- the smarter brother of grep
(depending upon who you ask. Their mother loves them both though).
What is awk?
It's just a text processing tool. It does a ton. But what it's going to help you do day-to-day is more than likely:
- Searching through long lists of files
- Searching through long files
- Getting and molding output from those long files
A more practical idea of what it is can be explained through a scenario:
Suppose you have a long, rotating log file named entries.log
with tons of ... logs. On the other hand you have Jerry the Developer who's been wrecking havoc in the cloud with his mighty blade of code leaving errors abound left and right. You need to search through your entries.log
and grab the most important ones so that you can yell at him.
Obviously there's a ton of ways to analyze this, but this is an awk
post. So let's move on to what it could do for you in this scenario, and by the end you'll know how it works and be able to imagine all sorts of things you can do with it. We'll use the following file as our example to work on:
entries.log
Wed Jul 16 2019 12:35:23 GMT-0700 Success: Some Message
Wed Jul 17 2019 12:35:23 GMT-0700 Error: Some Message
Wed Jul 17 2019 12:37:14 GMT-0700 Error: Some Important Message
Wed Jul 18 2019 12:37:14 GMT-0700 Success: Some Message
Using awk
So, awk separates lines of text by "columns." Those columns are made via delimiter. By default, a space is that delimiter. SO. For this line:
Wed Jul 17 2019 12:37:14 GMT-0700 Error: Some Message
There are 9 columns. Why? Because there's 9 different pieces of text, divided by a space. AWK recognizes each of those as their own column:
Wed Jul 17 2019 12:37:14 GMT-0700 Error: Some Message
^ ^ ^ ^ ^ ^ ^ ^ ^
$1 $2 $3 $4 $5 $6 $7 $8 $9
To see this in action, if you ran:
$ echo 'Wed Jul 17 2019 12:37:14 GMT-0700 Error: Some Message' | awk '{print $1}'
You'd get back:
Wed
The {print $1}
is an "action statement." Basically, what actions do you want awk
to take on the text that we feed it? In this case, we want it to print
the first column of our output. There's a ton of other things you can do with these action statements and functions, but let's not get too far from foundations here.
How it's actually useful
Okay, that's nice and all, but how does this help? Well, it lets you search through files in a much more organized way. So let's revisit our entries.log
file that has these contents:
Wed Jul 16 2019 12:35:23 GMT-0700 Success: Some Message
Wed Jul 17 2019 12:35:23 GMT-0700 Error: Some Message
Wed Jul 17 2019 12:37:14 GMT-0700 Error: Some Important Message
Wed Jul 18 2019 12:37:14 GMT-0700 Success: Some Message
And let's say that this file is constantly being populated with entries and we want to search for only the error messages. How would we do it?
Well right now, we've seen an "action statement" to print a column for the search output. But the other part to awk
is the "pattern." What "pattern" do we want awk
to search for? Well, if we wanted it to only look for lines with the word "Error" in it...
$ cat entries.log | awk '/Error/'
And here, as you can see, the "pattern" is just a regular expression. Granted it can be other things, but we'll stick with this one type for now.
So in plain english this is saying, "Hey cat
, print the output of the entries.log
file and pipe it to awk
." And then awk
gets it and says, "Okay, now I'm going to look for all lines that have a match of the regular expression /Error/
."
The result?
$ cat entries.log | awk '/Error/'
Wed Jul 17 2019 12:35:23 GMT-0700 Error: Some Message
Wed Jul 17 2019 12:37:14 GMT-0700 Error: Some Important Message
What if you just wanted to see the error messages? Well, then you'd need to combine a pattern AND an action statement like so:
$ cat entries.log | awk -F "Error: " '/Error/ {print $2}'
Some Message
Some Important Message
In plain english - cat
does its usual bit. It outputs the contents of the entries.log
file. The |
once again pipes the output to our awk
command. Once awk
gets the output, it says,
"Okay, given this file, I'm going to use Error:
(with the space) as the delimiter for my columns. I know I usually use a normal space, but the -F
option tells me to use whatever this gung-ho developer passes in as the delimiter."
Which means that to awk
, it's now going to see that error file and its columns like this:
Wed Jul 17 2019 12:37:14 GMT-0700 Error: Some Message
^ ^
$1 $2
"I'm going to search through all of the output from entries.log
and look for lines that match the expression /Error/
. For any matches, I'm going to print the second column of that line."
Meaning that our command and its output will look like so:
$ cat entries.log | awk -F "Error: " '/Error/ {print $2}'
Some Message
Some Important Message
And then of course, if you were looking for a specific message, maybe an important one, you can chain patterns together:
$ cat entries.log | awk -F "Error: " '/Error/ && /Important/ {print $2}'
Some Important Message
Granted, you can always make more complex regular expression. This one is just more readable for our purposes.
Okay, and so that's neat. We can now take a file and search it as if each line in the file is a row. And in each row, it has columns that are delimited with a space by default, or with whatever you want by using F
. We also know that, when giving this stuff over to awk
we can have it use patterns to filter the input and then action-statements to do stuff with that output.
So, one more bonus thing. What if you just want to count all the errors in our file so that you can yell at Jerry the Developer for his incompetence?
$ cat entries.log | awk -F "Error: " '/Error/ {print $2}' | wc -l
2
The wc -l
part of the command takes the lines that awk
found and counts them. In our case, there's 2 lines that match our awk
criteria as noted by wc
.
Summary
So the process of using awk
to search through files (or directories) is as follows:
Give it some text output.
$ cat entries.log
Pipe it to
awk
$ cat entries.log | awk
Tell
awk
what you want to use as columns in the lines$ cat entries.log | awk -F "Error: "
Now it'll treat the full string of
Error:
as its delimiter for lines instead of a single space.Given a pattern to search for
$ cat entries.log | awk -F "Error: " '/Error/'
Give it something to do with the found output
$ cat entries.log | awk -F "Error: " '/Error/ {print $2}'
Now it'll both find lines in the
entries.log
that has the wordError
in it and print the second column of the line.Yell at Jerry the Developer:
"Dammit Jerry, there's 2 errors today."
--
Alrighty, there's our quick, practical overview of awk
. Yes, there's a ton more that you can do with it (as outlined in the awk man page). But even with just this basic knowledge you can get done 80% of what you need to.
J Cole Morrison
http://start.jcolemorrison.comDeveloper Advocate @HashiCorp, DevOps Enthusiast, Startup Lover, Teaching at awsdevops.io