aws: opensearch
This commit is contained in:
parent
57e14c4266
commit
57f8baeacd
3 changed files with 198 additions and 0 deletions
190
DevOps/AWS/OpenSearch.md
Normal file
190
DevOps/AWS/OpenSearch.md
Normal file
|
@ -0,0 +1,190 @@
|
||||||
|
---
|
||||||
|
categories:
|
||||||
|
- DevOps
|
||||||
|
tags: [AWS]
|
||||||
|
---
|
||||||
|
|
||||||
|
# OpenSearch
|
||||||
|
|
||||||
|
## Background
|
||||||
|
|
||||||
|
OpenSearch is the AWS implementation of the opean source Elasticsearch search engine. It was formally known as "AWS Elasticsearch".
|
||||||
|
|
||||||
|
It has many features but the core usage is to create searchable indicies of a given content domain such as, for example, a website or content management system. This enables the quick search and retrieval of documents without using expensive database queries.
|
||||||
|
|
||||||
|
## Key concepts
|
||||||
|
|
||||||
|
We will introduce the main concepts with the example of an internal intranet for which we want to create a searchable index. The intranet comprises hundreds of pages. Each page has the following metadata, conforming to the following example.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"title": "Internal News",
|
||||||
|
"author": "Jane Doe",
|
||||||
|
"published_date": "2023-11-01T00:00:00Z",
|
||||||
|
"tags": ["news", "internal"],
|
||||||
|
"categories": ["communication"],
|
||||||
|
"content": "Today's internal news and updates are..."
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Create domain
|
||||||
|
|
||||||
|
The OpenSearch domain is a managed environment which hosts OpenSearch **clusters**. It can contain one or more clusters.
|
||||||
|
|
||||||
|
The domain provides network **endpoints** you use to communicate and send requests. Typical requests:
|
||||||
|
|
||||||
|
- ingest data
|
||||||
|
- index data
|
||||||
|
- run search query and return matches
|
||||||
|
|
||||||
|
#### Clusters and nodes
|
||||||
|
|
||||||
|
A cluster is the highest level of organisation in OpensSearch that contains your indexed data. It processes all the search queries and handles tasks like indexing, searching, and managing documents.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
A cluster comprises **nodes**. Nodes are individual servers that hold part of the cluster's data. Each node participates in the indexing and searching of the cluster's data.
|
||||||
|
|
||||||
|
- This distributed architecture helps in balancing the load and ensuring high availability.
|
||||||
|
- Data can be replicated accross nodes, making the system resiliant against data loss
|
||||||
|
- You can add more nodes to the cluster to handle increased data and traffic, making the system adaptable to changing needs.
|
||||||
|
|
||||||
|
### Define index and mappings
|
||||||
|
|
||||||
|
Assuming the domain has been created. The next step is to create an index for the data, say `intranet_pages`. An index is basically a collection where the data is stored, similar to a database. Each entry is a **document** in this collection.
|
||||||
|
|
||||||
|
Our index will store each webpage as a document.
|
||||||
|
|
||||||
|
We specify the data that we want to store using an index mapping. For instance we may not want to store all the metadata for each page, preferring only to index a subset of the properties. In this example we will store all the data. We would achieve this with the following:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"mappings": {
|
||||||
|
"properties": {
|
||||||
|
"title": { "type": "text" },
|
||||||
|
"author": { "type": "keyword" },
|
||||||
|
"published_date": { "type": "date" },
|
||||||
|
"tags": { "type": "keyword" },
|
||||||
|
"content": { "type": "text" }
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The mapping will be utilised in the following scenarios:
|
||||||
|
|
||||||
|
- storing data:
|
||||||
|
- when a new document is added to the `intranet_pages` index, it will adhere to the defined mappings
|
||||||
|
- it will have to have the properties specified in the mapping in order to be added
|
||||||
|
- searching data:
|
||||||
|
- when executing searches, OpenSearch utilizes the mappings to understand the type of data in each field and optimizes search queries accordingly
|
||||||
|
- assessing relevance:
|
||||||
|
- proper mappings allow OpenSearch to accurately score and rank search results based on relevance.
|
||||||
|
|
||||||
|
### Ingest data (bulk import and/or scraper)
|
||||||
|
|
||||||
|
In order to create the index that we previously defined with a mapping it is necessary to implement some sort of mechanism for collating the metadata that matches the mapping. This would be a crawler or scraper that might be implemented with a lambda. This is a key part of the ingestion process.
|
||||||
|
|
||||||
|
Alternatively the data could be bulk imported in a format that maps to the index.
|
||||||
|
|
||||||
|
### Querying and searching
|
||||||
|
|
||||||
|
Having established the crawler (and some kind of search interface), we can now run queries against the OpenSearch domain.
|
||||||
|
|
||||||
|
A basic example of the structure of a query would be as follows:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
GET /intranet_pages/_search
|
||||||
|
```
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"query": {
|
||||||
|
"match": {
|
||||||
|
"content": "project"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Here we search against the `content` mapping to find pages that contain the word "project".
|
||||||
|
|
||||||
|
Example of the data returned:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"hits": {
|
||||||
|
"total": { "value": 100 },
|
||||||
|
"hits": [
|
||||||
|
{
|
||||||
|
"_index": "intranet_pages",
|
||||||
|
"_id": "1",
|
||||||
|
"_score": 1.2,
|
||||||
|
"_source": {
|
||||||
|
"title": "Project ABC Launch",
|
||||||
|
"author": "John Doe",
|
||||||
|
"published_date": "2023-01-01T00:00:00Z",
|
||||||
|
"tags": ["project", "launch"],
|
||||||
|
"content": "Details about the launch of Project ABC..."
|
||||||
|
}
|
||||||
|
}
|
||||||
|
// Additional results here
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Search patterns
|
||||||
|
|
||||||
|
Below are further examples of commonly used search patterns.
|
||||||
|
|
||||||
|
### Multiple conditions
|
||||||
|
|
||||||
|
Find documents that are authored by Jane Doe that contain the word "meeting". `must` stands for boolean AND:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"query": {
|
||||||
|
"bool": {
|
||||||
|
"must": [
|
||||||
|
{ "match": { "content": "meeting" } },
|
||||||
|
{ "match": { "author": "Jane Doe" } }
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Find documents that contain either the word "meeting" or the word "project" in their content. `should` stands for boolean OR:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"query": {
|
||||||
|
"bool": {
|
||||||
|
"should": [
|
||||||
|
{ "match": { "content": "meeting" } },
|
||||||
|
{ "match": { "content": "project" } }
|
||||||
|
],
|
||||||
|
"minimum_should_match": 1
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`minimum_should_match` specifies the number of conditions that should match.
|
||||||
|
|
||||||
|
### Query by date ranges
|
||||||
|
|
||||||
|
Find pages published after a certain date:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"query": {
|
||||||
|
"range": {
|
||||||
|
"published_date": {
|
||||||
|
"gte": "2023-01-01T00:00:00Z"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
4
_img/opensearch-architecture.drawio.svg
Normal file
4
_img/opensearch-architecture.drawio.svg
Normal file
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 13 KiB |
4
_img/opensearch.drawio.svg
Normal file
4
_img/opensearch.drawio.svg
Normal file
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 13 KiB |
Loading…
Add table
Reference in a new issue