Elasticsearch How to Model Relationships Between Documents in Elasticsearch Using Join

By Opster Expert Team

Updated: Mar 2, 2023

| 8 min read

Quick links

This article is part 3 of a 3 part series on modeling relationships between documents in Elasticsearch.

Background and Overview

Elasticsearch has several methods for defining relationships between documents.  Article 1 discussed using object-type fields, while article 2 discussed nested fields.  This article will look at a third method, the join data type field, which establishes parent-child relationships between documents belonging to the same index. Here, each relation is defined by a parent and child name in the relations section. This section defines the set of relations within and among documents.

Uses

The join data type allows users to establish parent-child relationships between documents. This technique allows users to utilize distinct Elasticsearch documents for various types of data. For example, group documents could be designated as parents to event documents, indicating which event each group hosts. This can also be used to look for local groups hosting events nearby or groups hosting Elasticsearch-related events.

You can utilize different Elasticsearch documents with parent-child relationships by putting them into various types and specifying their relationship in the mapping of each.

Use the has_parent or has_child queries and filters during the search process to take the other part of the relationship into account. Later in this article, this will be discussed further.

Use separate Elasticsearch documents and define parent-child relationships between them if your nested documents grow too large.

How to use the join field type

To use the join field type with parent-child relationships, two things are required:

  • Creation of a field with a join type.
  • Adding extra details about the relationship using the relations object (for example, author-book relationship in the current context).

In the following example of defining parent/child relations, the “join_field” represents the field name of type join, and the “relations” object defines a single relation, where “author” is the parent of “book.” The following request represents the  explicit mapping:

PUT my-index
{
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "join_field": {
        "type": "join",
        "relations": {
          "author": "book"
        }
      }
    }
  }
}

Once the schema has been created and indexed, two different types of documents are indexed: one representing the author (parent), and the other representing the book (child).

Both the name of the relation and the document’s optional parent must be provided in the source to index a document with a join. The example below illustrates just that, creating two parent documents in the author context:

PUT my-index/_doc/1?refresh
{
  "id": "1",
  "name": "John Stefan",
  "join_field": {
    "name": "author"
  }
}

PUT my-index/_doc/2?refresh
{
  "id": "2",
  "name": "Sandy Naily",
  "join_field": {
    "name": "author"
  }
}

The documents above are author documents. 

When indexing parent documents, users have the option to shorten relation names rather than enclosing these in the standard object notation, as follows:

PUT my-index/_doc/1?refresh
{
  "id": "1",
  "name": "John Stefan",
  "join_field": "author"
}

PUT my-index/_doc/2?refresh
{
  "id": "2",
  "name": "Sandy Naily",
  "join_field": "author"
}

The relation name and the document’s parent ID must be added to the _source when indexing a child. See the example below demonstrating how to index two child documents:

PUT my-index/_doc/3?routing=1&refresh 
{
  "id": "3",
  "name": "Machine learning",
  "join_field": {
    "name": "book",
    "parent": "1"
  }
}

PUT my-index/_doc/4?routing=1&refresh
{
  "id": "4",
  "name": "Data Mining",
  "join_field": {
    "name": "book",
    "parent": "1"
  }
}

Shard routing

Note that all documents inside a single parent-child relationship must be inside the same shard.  If there is more than one shard, users will need to add the ?routing=1 parameter.

It is also possible to define multiple levels of parent/child, though not recommended, given the overhead of creating the joins for multiple layers.  Users can create multiple levels as shown below:

PUT my-index
{
  "mappings": {
    "properties": {
      "join_field": {
        "type": "join",
        "relations": {
          "author": [
            "book",
            "article"
          ],
          "book": "chapter"
        }
      }
    }
  }
}

Here, the author is the parent of the book and article, and the book is the parent of the chapter.

When indexing multiple layers of parent/child documents, the entire tree must be on the same shard using the routing parameter.

‘Has’ parent query

The ‘has’ parent query returns all child documents whose joined parent matches that specific query.

You need join field mapping in your index in order to use the has_parent query. 

The has_parent query has the following top-level parameters:

  • Parent_type: a required parameter of type string that represents the parent relationship name mapped for the join field.
  • Query: a required query object representing the query users want to conduct on parent documents in the parent_type field. The query returns a parent document’s children if the parent document matches the search query.
  • Score:  an optional parameter of type boolean that specifies if a matching parent document’s relevance score is aggregated into its child documents. The default value is false. Elasticsearch disregards parent document’s relevance score if set to false. Elasticsearch also assigns a relevance score to each child document that is equal to the query’s boost, which defaults to 1. If set to true, the relevance score of the matching parent document is aggregated into the relevance scores of its children documents.
  • Ignore_unmapped: an optional parameter of type boolean that specifies whether to ignore an unmapped parent_type, and not return any documents in place of an error. The default value is false. When set to false, if the parent_type is unmapped, Elasticsearch returns an error. This parameter can be used to query many indices, some of which might not include the parent_type.

The following example returns books whose joined parent author matches the term query, where the tag is “Elastic Stack”: 

GET /my-index/_search
{
  "query": {
    "has_parent": {
      "parent_type": "author",
      "query": {
        "term": {
          "name": {
            "value": "John Stefan"
          }
        }
      }
    }
  }
}

Has child query

The has_child query returns all parent documents whose joined child documents match the given query. A join field mapping in the index is required to use the has_parent query. For instance:

PUT /my-index
{
  "mappings": {
    "properties": {
      "join-field": {
        "type": "join",
        "relations": {
          "parent": "child"
        }
      },
      "tag": {
        "type": "keyword"
      }
    }
  }
}

The has_child query has the following top-level parameters:

  • Type: a required parameter of type string, representing the child relationship name mapped for the join field.
  • Query: a required query object representing the query users want to run on child documents of the type field. This query returns a parent document if the child document matches the search query.
  • Ignore_unmapped: an optional parameter of type boolean that specifies whether to ignore an unmapped type, and not return any documents instead of an error. The default value is false. When set to false, if the type is unmapped, Elasticsearch returns an error. This parameter can be used to query many indices, some which might not include the type.
  • Max_children: an optional parameter of type integer, representing the maximum number of child documents that match a parent document query. The parent document is excluded from the search results if it exceeds this limit.
  • Min_children: an optional parameter of type integer, representing the minimum number of child documents that match a parent document query. The parent document is excluded from search results if it is less than this limit.
  • Score_mode: an optional parameter of type string that specifies how the relevance score of the root parent document is impacted by the scores of the matching child documents. This parameter has five valid values: none, avg, max, min, and sum. There is no default value, meaning the relevance scores of the matching child documents should not be used.
    • The query provides a 0 score for parent documents. 
    • Avg means utilizing the average relevance score of all matching child documents. 
    • Max means utilizing the maximum relevance score of all matching child documents. 
    • Min means utilizing the minimum relevance score of all matching child documents. 
    • Sum means the sum of all matched child documents’ relevance scores added up.

The following example returns parent documents whose joined child documents match the query:

GET /my-index/_search
{
  "query": {
    "has_child": {
      "type": "books",
      "query": {
        "match_all": {}
      },
      "max_children": 10,
      "min_children": 2,
      "score_mode": "min"
    }
  }
}

Sorting has parent or has child queries

The results of a has_child or has_parent query cannot be sorted using standard sort options.

Use a function_score query and sort by _score if you want to sort the returned documents according to a field in their child or parent documents. For instance, the following query sorts the returned documents based on the “likes” field in the child book documents.

GET /my-index/_search
{
  "query": {
    "has_child": {
      "type": "book",
      "query": {
        "function_score": {
          "script_score": {
            "script": "_score * doc['likes'].value"
          }
        }
      },
      "score_mode": "max"
    }
  }
}

Parent ID query

The parent ID query returns child documents joined to a particular parent document. 

The parent_id query has the following top-level parameters:

  • Type: a required parameter of type string that represents the child relationship name mapped for the join field.
  • Id: a required parameter of type string that represents the parent document ID. The child documents of this parent document will be returned by the query.
  • Ignore_unmapped: an optional parameter of type boolean that specifies whether to ignore an unmapped type and return no documents in place of an error. The default value is false. When it is set to false, if the type is unmapped, Elasticsearch returns an error. This parameter can be used to query many indices, some of which might not include the type.

The following parent_id query example returns the child documents for a parent document with an ID of 1:

GET /my-index/_search
{
  "query": {
    "parent_id": {
      "type": "child",
      "id": "1"
    }
  }
}

Children aggregation

Children aggregation is a single bucket aggregation that picks out child documents that have the specified type, as defined in the join field. Children aggregation has only one parameter, “type,” which represents the child type that should be chosen.

The author document contains a name field and the book documents contain a title field. With the children aggregation, the tag buckets can be mapped to the title buckets in one request, even if the two fields exist in two different types of documents. The following search request can be used to connect the two together:

GET /my-index/_search
{
  "aggs": {
    "top-tags": {
      "terms": {
        "field": "name.keyword",
        "size": 10
      },
      "aggs": {
        "to-book": {
          "children": {
            "type": "book"
          },
          "aggs": {
            "top-titles": {
              "terms": {
                "field": "book.title.keyword",
                "size": 10
              }
            }
          }
        }
      }
    }
  }
}

The above search request will return top author names and the top book titles per author.

Parent/child inner hits

Users can use the parent/child inner_hits to include either the parent or the child. First, create an index with a join field mapping:

PUT /my-index
{
  "mappings": {
    "properties": {
      "join-field": {
        "type": "join",
        "relations": {
          "parent": "child"
        }
      }
    }
  }
}

Then, index parent and child documents:

PUT my-index/_doc/1?refresh
{
  "number": 1,
  join_field": "parent"
}

PUT my-index/_doc/2?routing=1&refresh
{
  "number": 1,
  "join_field": {
    "name": "child",
    "parent": "1"
  }
}
POST my-index/_search
{
  "query": {
    "has_child": {
      "type": "child",
      "query": {
        "match": {
          "number": 1
        }
      },
      "inner_hits": {}    
    }
  }
}

The above search request will return the following response:

{
  ...,
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "my-index",
        "_id": "1",
        "_score": 1.0,
        "_source": {
          "number": 1,
          "join_field": "parent"
        },
        "inner_hits": {
          "child": {
            "hits": {
              "total": {
                "value": 1,
                "relation": "eq"
              },
              "max_score": 1.0,
              "hits": [
                {
                  "_index": "test",
                  "_id": "2",
                  "_score": 1.0,
                  "_routing": "1",
                  "_source": {
                    "number": 1,
                    "join_field": {
                      "name": "child",
                      "parent": "1"
                    }
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

Global ordinals

Global ordinals are used in the join field to expedite joins. You can read more about global ordinals here: Elasticsearch Global Ordinals and High Cardinality Fields.

Any modification to a shard necessitates the rebuilding of global ordinals. Rebuilding global ordinals for the join field takes longer if more parent ID values are stored in a shard.

Disabling eager loading makes sense when writes are frequent and the join field is infrequently used:

PUT my-index
{
  "mappings": {
    "properties": {
      "join_field": {
        "type": "join",
        "relations": {
          "author": "book"
        },
        "eager_global_ordinals": false
      }
    }
  }
}

The following can be used to check the amount of heap utilized by global ordinals per parent relation:

Per-index:

GET _stats/fielddata?human&fields=join_field#author

Per-node per-index:

GET _nodes/stats/indices/fielddata?human&fields=join_field#author

Advantages and disadvantages of join field type

Advantages:

  • Since parent and child documents are separate, users can update parent and child documents independently. In nested documents, modification of a single object requires complete re-indexing of the entire document’s structure.

Disadvantages:

  • Queries are more expensive and memory-intensive than nested equivalents. Although parent and child documents are located in the same shard, they are not necessarily in the same Lucene block, therefore joins aren’t as quick as nested documents. Aggregations can only relate child documents to their parents; they can not relate parents to children.
  • Control over sorting and scoring is limited. Users can either filter on parent content or child documents, but not both.  

Parent-join limitations

  • Per index, only one join field mapping is permitted.
  • Documents for both parents and children must be indexed on the same shard. Meaning, when getting, deleting, or updating a child document, the same routing value must be given.
  • There can be more than one child for an element, but only one parent.
  • An existing join field can have a new relation added.
  • A child can be added to an element that already exists, but only if the element is a parent.

Notes

  • It is not advised to replicate a relational model utilizing multiple relation levels. During queries, each relation level adds overhead in terms of processing and memory. Instead, denormalize your data for improved search performance.
  • Users must always route child documents to the same shard as the parent document. The use of the join field should not be similar to relational database joins. Elasticsearch performance can be improved by denormalizing your data into documents. Query performance will suffer significantly from each join field, has_child, or has_parent query. 
  • The join field is only appropriate when there is a one-to-many relationship between two entities and one of the entities is outnumbered by the other.
  • The join operation performed by the has_parent query makes it slower than other queries. Since there are more parent documents that match, its performance suffers. Each has_parent query in a search can considerably lengthen query time.
  • The join operation performed by the has_child query makes it slower than other queries. Sincethere are more matching child documents that point to unique parent documents, performance suffers. Each has_child query in a search can considerably lengthen query time. Do not use the has_child query if query performance is important to you. Use the has_child query as little only when you absolutely have to. 
  • Heap memory is used by global ordinal mapping as a component of the field data cache. Aggregations on fields with high cardinality can consume significant memory and trigger the field data circuit breaker.

Summary

  • Elasticsearch uses the join field type to define relations between data in Elasticsearch; it is used to define a type within an index, as a child of another type of the same index. When documents or relationships are frequently updated, this is helpful. The parent field in the mapping is where users specify the relationship.
  • Although they are routed to the same shard, children are stored as separate documents to their parents. Therefore, parent/children joins perform slightly worse on read/query than they do on nested.
  • Because Elasticsearch keeps a “join” list in memory, parent/child mappings have a little bit more memory overhead.
  • When it comes to large documents, updating a child document has no impact on the parent or any other children.
  • With Parent/Child, sorting and scoring can be challenging.