eolas/neuron/0c4abb9b-c940-4636-84e0-b880b7c1ac8b/Awk.md

---
tags:
  - shell
  - awk
---

# Awk

> Awk is a programming language designed for text processing and data
> extraction. It was created in the 1970s and remains widely used today for
> tasks such as filtering and transforming text data, generating reports, and
> performing basic calculations. Awk is known for its simplicity and
> versatility, making it a popular tool for Unix system administrators and data
> analysts.

## Invocation

We can use `awk` directly in `stdin` or we can reference `.awk` files for more
elaborate scripts

```bash
# CLI
awk [program] file1, file2, file3

# Script file
awk -f [ref_to_script_file] file1, file2, file3
```

We can also pipe to it. This piped command receives output from the `echo`
command and prints the value in the last field for each record:

```bash
echo -e "1 2 3 5\n2 2 3 8" | awk '{print $(NF)}'
```

## Syntactic structure

`awk` is a line-oriented language.

An `awk` program consists in a sequence of **pattern: action** statements and
optional functional definitions.

For most of the examples we will use this list as the input:

```
cloud
existence
ministerial
falcon
town
sky
top
bookworm
bookcase
war
Peter 89
Lucia 95
Thomas 76
Marta 67
Joe 92
Alex 78
Sophia 90
Alfred 65
Kate 46
```

> `awk` particularly lends itself to inputs that are structured by whitespace or
> in columns, like what you get from commands like `ls` and `grep`

### Patterns and actions

The basic structure of an `awk` script is as follows:

```
pattern {action}
```

A **pattern** is what you want to match against. It can be a literal string or a
regex. The **action** is what process you want to execute against the lines in
the input that match the pattern.

The following script prints the line that matches `Joe`:

```bash
awk '/Joe/ {print}' list.txt
```

`/Joe/` is the patttern and `{print}` is the action.

### Lines, records, fields

![](static/awk-outline.png)

When `awk` receives a file it divides the lines into **records**.

Each line `awk` receives is broken up into a sequence of **fields**.

The fields are accessed by special variables:

- `$1` reads the first field, `$2` reads the second field and so on.

- The variable `$0` refers to the whole record

So, in the picture `cloud existence ministerial` corresponse to `$1` `$2` `$3`

## Basic examples

**_Match a pattern_**

```bash
awk '/book/ { print }' list.txt
# bookworm
# bookcase
```

**_Print all words that are longer that five characters_**

```bash
awk 'length($1) > 5 { print $0 }' list.txt
```

For the first field of every line (we only have one field per line), if it is
greater than 5 characters print it. The "every line" part is provided for via
the all fields variable - `$0`.

We actually don't need to include the `{ print $0 }` action, as this is the
default behaviour. We could have just put `length($1) > 5 list.txt`

**_Print all words that do not have three characters_**

```bash
awk '!(length($1) == 3)' list.txt
```

Here we negate by prepending the pattern with `!` and wrapping it in
parentheses.

**_Return words that are either three characters or four characters in length_**

```
awk '(length($1) == 3) || (length($1) == 4)' list.txt
```

Here we use the logical OR to match against more than one pattern. Notice that
whenever we use a Boolean operator such as NOT or OR, we wrap our pattern in
parentheses.

**_Match and string-interpolate the output_**

```bash
awk 'length($1) > 0  {print $1, "has", length($1), "chars"}' list.txt

# storeroom has 9 chars
# tree has 4 chars
# cup has 3 chars
```

**_Match against a numerical property_**

```bash
awk '$2 >= 90 { print $0 }' scores.txt

# Lucia 95
# Joe 92
# Sophia 90
```

This returns the records where there is a secondary numerical field that is
greater than 90.

**_Match a field against a regular expression_**

```bash
awk '$1 ~ /^[b,c]/ {print $1}' words.txt
```

This matches all the fields in the `$1` place that begin with 'b' or 'c'.

The tilde is the regex match operator. You must be passing a regex to use it,
otherwise use `==`.

## Syntactic shorthands

- For a statement like `awk 'length($1) > 5 { print $0 }' list.txt`. We actually
  don't need to include the `{ print $0 }` action, as this is the default
  behaviour and it is implied. We could have just put `length($1) > 5 list.txt`.

https://zetcode.com/lang/awk/

## Built-in variables

### `NF`

The value of `NF` is the **number** of **fields** in the current record. `Awk`
automatically updates the value of `NF` every time it reads a record.

No matter how many fields there are, the last value in a record can always be
represented by `$NF`.

### `NR`

`NR` represents the **number** of **records**. It is set at the point at which
the file is read.

### `FS`

`FS` represents the **field separator**. The default field separator is a space.
We can specify a different separator with the `-F` flag. E.g to separate by
comma:

```bash
awk -F, '{print $1 }' list.txt
```