2023-02-08 07:45:59 +00:00
|
|
|
---
|
|
|
|
categories:
|
|
|
|
- Programming Languages
|
|
|
|
tags:
|
|
|
|
- shell
|
|
|
|
- awk
|
|
|
|
---
|
|
|
|
|
|
|
|
# Awk
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
> Awk is a programming language designed for text processing and data
|
|
|
|
> extraction. It was created in the 1970s and remains widely used today for
|
|
|
|
> tasks such as filtering and transforming text data, generating reports, and
|
|
|
|
> performing basic calculations. Awk is known for its simplicity and
|
|
|
|
> versatility, making it a popular tool for Unix system administrators and data
|
|
|
|
> analysts.
|
2023-02-08 07:45:59 +00:00
|
|
|
|
|
|
|
## Invocation
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
We can use `awk` directly in `stdin` or we can reference `.awk` files for more
|
|
|
|
elaborate scripts
|
2023-02-08 07:45:59 +00:00
|
|
|
|
|
|
|
```bash
|
|
|
|
# CLI
|
|
|
|
awk [program] file1, file2, file3
|
|
|
|
|
|
|
|
# Script file
|
|
|
|
awk -f [ref_to_script_file] file1, file2, file3
|
|
|
|
```
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
We can also pipe to it. This piped command receives output from the `echo`
|
|
|
|
command and prints the value in the last field for each record:
|
2023-02-10 18:22:04 +00:00
|
|
|
|
|
|
|
```bash
|
|
|
|
echo -e "1 2 3 5\n2 2 3 8" | awk '{print $(NF)}'
|
|
|
|
```
|
2023-02-08 07:45:59 +00:00
|
|
|
|
|
|
|
## Syntactic structure
|
|
|
|
|
2023-02-10 07:37:15 +00:00
|
|
|
`awk` is a line-oriented language.
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
An `awk` program consists in a sequence of **pattern: action** statements and
|
|
|
|
optional functional definitions.
|
2023-02-10 07:37:15 +00:00
|
|
|
|
|
|
|
For most of the examples we will use this list as the input:
|
|
|
|
|
|
|
|
```
|
|
|
|
cloud
|
|
|
|
existence
|
|
|
|
ministerial
|
|
|
|
falcon
|
|
|
|
town
|
|
|
|
sky
|
|
|
|
top
|
|
|
|
bookworm
|
|
|
|
bookcase
|
|
|
|
war
|
|
|
|
Peter 89
|
|
|
|
Lucia 95
|
|
|
|
Thomas 76
|
|
|
|
Marta 67
|
|
|
|
Joe 92
|
|
|
|
Alex 78
|
|
|
|
Sophia 90
|
|
|
|
Alfred 65
|
|
|
|
Kate 46
|
|
|
|
```
|
2023-02-08 07:45:59 +00:00
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
> `awk` particularly lends itself to inputs that are structured by whitespace or
|
|
|
|
> in columns, like what you get from commands like `ls` and `grep`
|
2023-02-10 18:22:04 +00:00
|
|
|
|
2023-02-08 07:45:59 +00:00
|
|
|
### Patterns and actions
|
|
|
|
|
|
|
|
The basic structure of an `awk` script is as follows:
|
|
|
|
|
|
|
|
```
|
|
|
|
pattern {action}
|
|
|
|
```
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
A **pattern** is what you want to match against. It can be a literal string or a
|
|
|
|
regex. The **action** is what process you want to execute against the lines in
|
|
|
|
the input that match the pattern.
|
2023-02-08 07:45:59 +00:00
|
|
|
|
2023-02-10 07:37:15 +00:00
|
|
|
The following script prints the line that matches `Joe`:
|
2023-02-08 07:45:59 +00:00
|
|
|
|
|
|
|
```bash
|
2023-02-10 07:37:15 +00:00
|
|
|
awk '/Joe/ {print}' list.txt
|
2023-02-08 07:45:59 +00:00
|
|
|
```
|
|
|
|
|
2023-02-10 07:37:15 +00:00
|
|
|
`/Joe/` is the patttern and `{print}` is the action.
|
2023-02-08 07:45:59 +00:00
|
|
|
|
|
|
|
### Lines, records, fields
|
|
|
|
|
2023-02-10 07:37:15 +00:00
|
|
|

|
|
|
|
|
2023-02-08 07:45:59 +00:00
|
|
|
When `awk` receives a file it divides the lines into **records**.
|
|
|
|
|
|
|
|
Each line `awk` receives is broken up into a sequence of **fields**.
|
|
|
|
|
|
|
|
The fields are accessed by special variables:
|
|
|
|
|
|
|
|
- `$1` reads the first field, `$2` reads the second field and so on.
|
|
|
|
|
2023-02-10 07:37:15 +00:00
|
|
|
- The variable `$0` refers to the whole record
|
2023-02-08 07:45:59 +00:00
|
|
|
|
2023-02-10 07:37:15 +00:00
|
|
|
So, in the picture `cloud existence ministerial` corresponse to `$1` `$2` `$3`
|
2023-02-08 07:45:59 +00:00
|
|
|
|
2023-02-10 07:37:15 +00:00
|
|
|
## Basic examples
|
2023-02-08 07:45:59 +00:00
|
|
|
|
2023-02-10 07:37:15 +00:00
|
|
|
**_Match a pattern_**
|
2023-02-08 07:45:59 +00:00
|
|
|
|
2023-02-10 07:37:15 +00:00
|
|
|
```bash
|
|
|
|
awk '/book/ { print }' list.txt
|
|
|
|
# bookworm
|
|
|
|
# bookcase
|
|
|
|
```
|
|
|
|
|
|
|
|
**_Print all words that are longer that five characters_**
|
|
|
|
|
|
|
|
```bash
|
|
|
|
awk 'length($1) > 5 { print $0 }' list.txt
|
|
|
|
```
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
For the first field of every line (we only have one field per line), if it is
|
|
|
|
greater than 5 characters print it. The "every line" part is provided for via
|
|
|
|
the all fields variable - `$0`.
|
2023-02-10 07:37:15 +00:00
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
We actually don't need to include the `{ print $0 }` action, as this is the
|
|
|
|
default behaviour. We could have just put `length($1) > 5 list.txt`
|
2023-02-10 07:37:15 +00:00
|
|
|
|
|
|
|
**_Print all words that do not have three characters_**
|
2023-02-08 07:45:59 +00:00
|
|
|
|
|
|
|
```bash
|
2023-02-10 07:37:15 +00:00
|
|
|
awk '!(length($1) == 3)' list.txt
|
2023-02-08 07:45:59 +00:00
|
|
|
```
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
Here we negate by prepending the pattern with `!` and wrapping it in
|
|
|
|
parentheses.
|
2023-02-10 07:37:15 +00:00
|
|
|
|
|
|
|
**_Return words that are either three characters or four characters in length_**
|
|
|
|
|
|
|
|
```
|
|
|
|
awk '(length($1) == 3) || (length($1) == 4)' list.txt
|
|
|
|
```
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
Here we use the logical OR to match against more than one pattern. Notice that
|
|
|
|
whenever we use a Boolean operator such as NOT or OR, we wrap our pattern in
|
|
|
|
parentheses.
|
2023-02-10 07:37:15 +00:00
|
|
|
|
|
|
|
**_Match and string-interpolate the output_**
|
|
|
|
|
|
|
|
```bash
|
|
|
|
awk 'length($1) > 0 {print $1, "has", length($1), "chars"}' list.txt
|
|
|
|
|
|
|
|
# storeroom has 9 chars
|
|
|
|
# tree has 4 chars
|
|
|
|
# cup has 3 chars
|
|
|
|
```
|
|
|
|
|
|
|
|
**_Match against a numerical property_**
|
|
|
|
|
|
|
|
```bash
|
|
|
|
awk '$2 >= 90 { print $0 }' scores.txt
|
|
|
|
|
|
|
|
# Lucia 95
|
|
|
|
# Joe 92
|
|
|
|
# Sophia 90
|
|
|
|
```
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
This returns the records where there is a secondary numerical field that is
|
|
|
|
greater than 90.
|
2023-02-10 07:37:15 +00:00
|
|
|
|
2023-02-10 18:22:04 +00:00
|
|
|
**_Match a field against a regular expression_**
|
|
|
|
|
|
|
|
```bash
|
|
|
|
awk '$1 ~ /^[b,c]/ {print $1}' words.txt
|
|
|
|
```
|
|
|
|
|
|
|
|
This matches all the fields in the `$1` place that begin with 'b' or 'c'.
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
The tilde is the regex match operator. You must be passing a regex to use it,
|
|
|
|
otherwise use `==`.
|
2023-02-10 18:22:04 +00:00
|
|
|
|
|
|
|
## Syntactic shorthands
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
- For a statement like `awk 'length($1) > 5 { print $0 }' list.txt`. We actually
|
|
|
|
don't need to include the `{ print $0 }` action, as this is the default
|
|
|
|
behaviour and it is implied. We could have just put `length($1) > 5 list.txt`.
|
2023-02-10 18:22:04 +00:00
|
|
|
|
2023-02-08 07:45:59 +00:00
|
|
|
https://zetcode.com/lang/awk/
|
2023-02-10 18:22:04 +00:00
|
|
|
|
|
|
|
## Built-in variables
|
|
|
|
|
|
|
|
### `NF`
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
The value of `NF` is the **number** of **fields** in the current record. `Awk`
|
|
|
|
automatically updates the value of `NF` every time it reads a record.
|
2023-02-10 18:22:04 +00:00
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
No matter how many fields there are, the last value in a record can always be
|
|
|
|
represented by `$NF`.
|
2023-02-10 18:22:04 +00:00
|
|
|
|
|
|
|
### `NR`
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
`NR` represents the **number** of **records**. It is set at the point at which
|
|
|
|
the file is read.
|
2023-02-10 18:22:04 +00:00
|
|
|
|
|
|
|
### `FS`
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
`FS` represents the **field separator**. The default field separator is a space.
|
|
|
|
We can specify a different separator with the `-F` flag. E.g to separate by
|
|
|
|
comma:
|
2023-02-10 18:22:04 +00:00
|
|
|
|
|
|
|
```bash
|
|
|
|
awk -F, '{print $1 }' list.txt
|
|
|
|
```
|