2022-04-23 13:26:53 +01:00
|
|
|
|
---
|
2022-09-06 15:44:40 +01:00
|
|
|
|
categories:
|
|
|
|
|
- Programming Languages
|
2022-04-23 13:26:53 +01:00
|
|
|
|
tags:
|
|
|
|
|
- shell
|
|
|
|
|
---
|
|
|
|
|
|
2023-03-16 06:58:39 +00:00
|
|
|
|
# Text manipulation
|
|
|
|
|
|
2022-04-23 13:26:53 +01:00
|
|
|
|
## Sorting strings: `sort`
|
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
|
If you have a `.txt` file containing text strings, each on a new line you can
|
|
|
|
|
use the sort function to quickly put them in alphabetical order:
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
2022-09-06 15:44:40 +01:00
|
|
|
|
```bash
|
2022-04-23 13:26:53 +01:00
|
|
|
|
sort file.txt
|
2022-09-06 15:44:40 +01:00
|
|
|
|
```
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
|
Note that this will not save the sort, it only presents it as a standard output.
|
|
|
|
|
To save the sort you need to direct the sort to a file in the standard way:
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
2022-09-06 15:44:40 +01:00
|
|
|
|
```bash
|
2022-04-23 13:26:53 +01:00
|
|
|
|
sort file.txt > output.txt
|
2022-09-06 15:44:40 +01:00
|
|
|
|
```
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
|
|
|
|
### Options
|
|
|
|
|
|
2022-09-06 15:44:40 +01:00
|
|
|
|
- `-r`
|
|
|
|
|
- reverse sort
|
|
|
|
|
- `c`
|
2024-02-02 15:58:13 +00:00
|
|
|
|
- check if file is already sorted. If not, it will highlight the strings which
|
|
|
|
|
are not sorted
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
|
|
|
|
## Find and replace: `sed`
|
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
|
The `sed` programme can be used to implement find and replace procedures. In
|
|
|
|
|
`sed`, find and replace are covered by the substitution option: `/s` :
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
2022-09-06 15:44:40 +01:00
|
|
|
|
```bash
|
2022-04-23 13:26:53 +01:00
|
|
|
|
sed ‘s/word/replacement word/’ file.txt
|
2022-09-06 15:44:40 +01:00
|
|
|
|
```
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
|
This however will only change the first instance of word to be replaced, in
|
|
|
|
|
order to apply to every instance you need to add the global option: `-g` .
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
|
As sed is a stream editor, any changes you make using it, will only occur within
|
|
|
|
|
the standard output , they will not be saved to file. In order to save to file
|
|
|
|
|
you need to specify a new file output (using `> output.txt`) in addition to the
|
|
|
|
|
original file. This hasthe benefit of leaving the original file untouched whilst
|
|
|
|
|
ensuring the desired outcome is stored permanently.
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
|
Alternatively, you can use the `-i` option which will make the changes take
|
|
|
|
|
place in the source file as well as in standard input.
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
|
Note that this will overwrite the original version of the file and it cannot be
|
|
|
|
|
regained. If this is an issue then it is recommended to include a backup command
|
|
|
|
|
in the overall argument like so:
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
2022-09-06 15:44:40 +01:00
|
|
|
|
```bash
|
2022-04-23 13:26:53 +01:00
|
|
|
|
sed -i.bak ‘s/word/replacement word/’ file.txt
|
2022-09-06 15:44:40 +01:00
|
|
|
|
```
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
|
This will create the file `file.txt.bak` in the directory you are working within
|
|
|
|
|
which is the original file before the replacement was carried out.
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
|
|
|
|
### Remove duplicates
|
|
|
|
|
|
|
|
|
|
We can use the `sort -u` command can be used to remove duplicates:
|
|
|
|
|
|
2022-09-06 15:44:40 +01:00
|
|
|
|
```bash
|
2022-04-23 13:26:53 +01:00
|
|
|
|
sort -u file.txt
|
2022-09-06 15:44:40 +01:00
|
|
|
|
```
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
|
It is important to sort before attempting to remove duplicates since the `-u`
|
|
|
|
|
flag works on the basis of the strings being adjacent.
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
|
|
|
|
## Split a large file into multiple smaller files: `split`
|
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
|
Suppose you have a file containing 1000 lines. You want to break the file up
|
|
|
|
|
into five separate files, each containing two hundred lines. You can use `split`
|
|
|
|
|
to accomplish this, like so:
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
2022-09-06 15:44:40 +01:00
|
|
|
|
```bash
|
2022-04-23 13:26:53 +01:00
|
|
|
|
split -l 200 big-file.txt new-files
|
2022-09-06 15:44:40 +01:00
|
|
|
|
```
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
|
|
|
|
`split` will categorise the resulting five files as follows:
|
|
|
|
|
|
2022-09-06 15:44:40 +01:00
|
|
|
|
- new-file-aa,
|
|
|
|
|
- new-file-ab
|
|
|
|
|
- new-file-ac,
|
|
|
|
|
- newfile-ad,
|
|
|
|
|
- new-file-ae.
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
|
If you would rather have numeric suffixes, use the option `-d` . You can also
|
|
|
|
|
split a file by its number of bytes, using the option `-b` and specifying a
|
|
|
|
|
constituent file size.
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
|
|
|
|
## Merge multiple files into one with `cat`
|
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
|
We can use `cat` read multiple files at once and then append a redirect to save
|
|
|
|
|
them to a file:
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
2022-09-06 15:44:40 +01:00
|
|
|
|
```bash
|
2022-04-23 13:26:53 +01:00
|
|
|
|
cat file_a.txt file_b.txt file_c.txt > merged-file.txt
|
2022-09-06 15:44:40 +01:00
|
|
|
|
```
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
|
|
|
|
## Count lines, words, etc: `wc`
|
|
|
|
|
|
|
|
|
|
To count words:
|
|
|
|
|
|
2022-09-06 15:44:40 +01:00
|
|
|
|
```bash
|
2022-04-23 13:26:53 +01:00
|
|
|
|
wc file.txt
|
2022-09-06 15:44:40 +01:00
|
|
|
|
```
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
|
When we use the command three numbers are outputted, in order: lines, words,
|
|
|
|
|
bytes.
|
2022-04-23 13:26:53 +01:00
|
|
|
|
|
|
|
|
|
You can use modifiers to get just one of the numbers: `-l`, `-w` , `-b` .
|