Fun With Elasticsearch's Children and Nested Documents_综合

原文地址：http://www.spacevatican.org/2012/6/3/fun-with-elasticsearch-s-children-and-nested-documents/

When you’re indexing data, the world is rarely as simple as each document existing in isolation. Sometimes, you’re better off denormalizing all data into the child documents. For example if you were modelling books, adding an author field to books could be a sensible choice (even if in the database that is your authoritative datasource the data is split into separateauthors and books table). It’s simple and you can easily construct queries on both attributes of the book and the author’s name.

That’s not always practical – there might be too much data in the parent document to duplicate it in each child document. If you had your typical blog/comment app then you wouldn’t want to repeat the entire content of the blog post in each comment as this would vastly increase the amount of indexed data several. Yet without that you can’t easily write queries to find comments on posts matching certain criteria (other than by doing a 2 step of process of first finding matching posts and then fetching comments with a certain post_id, which is often unwieldy or slow (or both)).

Another option is to place child documents inside the parent document, for example you can have documents of the form

 
      {
            "name": "A. N Author",  "biography": "A leading wordsmith",   "books": [  {
            "title": "My first book",  "genre": "scifi",  "publisher": "penguin"  }  ] }

One drawback here is that adding a child requires reindexing the entire document. Finding authors that have written books with a certain genre is easy but finding authors that have had a science fiction book published by penguin is harder.

If our index contains

 
      curl -XPUT localhost:9200/authors/author/1 -d'{
            "name": "Multi G. Enre",  "books": [  {
            "name": "Guns and lasers",  "genre": "scifi",  "publisher": "orbit"  },  {
            "name": "Dead in the night",  "genre": "thriller",  "publisher": "penguin"  }  ] }' curl -XPUT localhost:9200/authors/author/2 -d'{
            "name": "Alastair Reynolds",  "books": [  {
            "name": "Revelation Space",  "genre": "scifi",  "publisher": "penguin"  }  ] }'

then the most obvious query

 
      curl -XGET localhost:9200/authors/author/_search -d '{
            "query": {
            "filtered": {
            "query": {
          "match_all": {}},  "filter": {
            "and": [  {
          "term": {
          "books.publisher": "penguin"}},  {
          "term": {
          "books.genre": "scifi"}}  ]  }  }  } }''

finds both authors – you can’t express that your conditions on published andgenre need to match against the same book.

Nested Documents

ElasticSearch provides two things that help with this. The first is the concept of a nested document/query. This allows you to say that you are looking for authors where at least one book satisfies both of your criteria.

First you need to setup a mapping that says that the books field is going to be nested:

 
      curl -XPOST localhost:9200/authors/nested_author/_mapping -d '{
            "nested_author":{
            "properties":{
            "books": {
            "type": "nested"  }  }  } }'

If we then insert the same data as before into this new index then this query

 
  
   
     
      curl -XGET localhost:9200/authors/nested_author/_search -d ' {
            "query": {
            "filtered": {
            "query": {
          "match_all": {}},  "filter": {
            "nested": {
            "path": "books",  "query":{
            "filtered": {
            "query": {
           "match_all": {}},  "filter": {
            "and": [  {
          "term": {
          "books.publisher": "penguin"}},  {
          "term": {
          "books.genre": "scifi"}}  ]  }  }  }  }  }  }  } }'  
    

   
 

Here the nested filter allows you to run a query against the nested documents (ie the books) and filter authors by those that have at least one nested document matching the query. Thepath option tells us which part of the authors document this query applies to and then thequery option is a query to run against these nested documents. Unlike the previous query this requires that an individual book be found satisfying both requirements so only Alaistair Reynolds is returned

Parent & child

The other concept elasticsearch provides is that of a parent and child relationship between documents. The previous example can be reworked with authors as the parent documents and books as the child documents.

This time, index the authors separately from their books:

 
      curl -XPUT localhost:9200/authors/bare_author/1 -d'{
            "name": "Multi G. Enre" }'  curl -XPUT localhost:9200/authors/bare_author/2 -d'{
            "name": "Alastair Reynolds" }'

Then configure the mapping for the book type and say that its parent type is bare_author. You need to do this before you create any books.

 
      curl -XPOST localhost:9200/authors/book/_mapping -d '{
            "book":{
            "_parent": {
          "type": "bare_author"}  } }'

When we indexing books, you must then give the id of their parent (ie we supply the id of one of the previously created authors)

 
      curl -XPOST localhost:9200/authors/book/1?parent=2 -d '{
            "name": "Revelation Space",  "genre": "scifi",  "publisher": "penguin" }'  curl -XPOST localhost:9200/authors/book/2?parent=1 -d '{
            "name": "Guns and lasers",  "genre": "scifi",  "publisher": "orbit" }'  curl -XPOST localhost:9200/authors/book/3?parent=1 -d '{
            "name": "Dead in the night",  "genre": "thriller",  "publisher": "penguin" }'

Elasticsearch provides a has_child filter that does pretty much what is says on the tin: it selects parent documents with a least one child satisfying a certain query. This query then finds only Alastair Reynolds:

 
  
   
     
      curl -XPOST localhost:9200/authors/bare_author/_search -d '{
            "query": {
            "has_child": {
            "type": "book",  "query" : {
            "filtered": {
            "query": {
           "match_all": {}},  "filter" : {
            "and": [  {
          "term": {
          "publisher": "penguin"}},  {
          "term": {
          "genre": "scifi"}}  ]  }  }  }  }  } }'  
    

   
 

Solr 4.0 will apparently have the ability to do joins, although as far as I can tell this comes with some restrictions, in particular no joins if you are operated in a distributed environment. By limiting itself to parent/child type relationships elasticsearch makes life easier for itself: a child is always indexed in the same shard as its parent, sohas_child doesn’t have to do awkward cross shard operations.

Building lists

You can also use this to model user specific lists of shared global items – e.g. if you wanted items that a user had rated. In this case your child documents would represent the fact that a specific user had rated a specific post – they are nothing more than a user_id, post_id and a rating: a join table in relational database lingo.

Using a parent/child relationships and has_child allows you to easily find all the posts favourited by a user while enabling users to search through their favourites based on post content, date or any other of a post’s attributes or any of the child item properties. Adding an item to the rated items list is cheap – it requires only indexing a very smallrating item.

With these documents

 
      curl -XPOST localhost:9200/posts  curl -XPOST localhost:9200/posts/rating/_mapping -d '{  "rating":{  "_parent": {"type": "post"}  } }'  curl -XPUT localhost:9200/posts/post/1 -d '{ "title": "bolivia rated 4" }' curl -XPUT localhost:9200/posts/post/2 -d '{ "title": "bolivia rated 2" }' curl -XPUT localhost:9200/posts/post/3 -d '{ "title": "another country rated 4" }' curl -XPUT localhost:9200/posts/rating/1?parent=1 -d '{ "user_id": 1234, "rating": 4}' curl -XPUT localhost:9200/posts/rating/2?parent=2 -d '{ "user_id": 1234, "rating": 2}' curl -XPUT localhost:9200/posts/rating/3?parent=2 -d '{ "user_id": 4234, "rating": 5}' curl -XPUT localhost:9200/posts/rating/4?parent=3 -d '{ "user_id": 1234, "rating": 5}'

This query

 
  
   
     
      curl -XPOST localhost:9200/posts/post/_search -d '{
            "query": {
            "filtered": {
            "query": {
            "text": {
          "title": "bolivia"}  },  "filter":{
            "has_child": {
            "type": "rating",  "query" : {
            "filtered": {
            "query": {
           "match_all": {}},  "filter" : {
            "and": [  {
          "term": {
          "user_id": 1234}},  {
          "range": {
          "rating": {
          "gt" : 3}}}  ]  }  }  }  }  }  }  } }'  
    

   
 

finds only “bolivia rated 4” since that is the only post mentioning bolivia that has been rated over 3 by the user we are interested in. The top level query on title applies to the posts, wheres the query inside thehas_child filter describes conditions that the children must match (in this case they must belong to a specific user and have at least a certain rating).

Ordering

What has_child doesn’t let you do is order based on attributes of the children or return attributes of the child. If you wanted to order a user’s rated posts based on when they were rated or by decreasing rating then you can search against posts/rating directly but you might want to apply some search critera to the posts too. For example you might want to only find rated posts on a certain topic (still ordering by the rating the user gave). Withhas_child, you’re out of luck. Nested documents don’t help either.

As of 0.19.10 you can use the has_parent filter. This works almost exactly the same as has child but allows you to specify a query against the parent items instead. This query returns the ratings by user 1234, on posts whose title matches “bolivia”, in decreasing score order

 
  
   
     
      curl -XPOST localhost:9200/posts/rating/_search?pretty=true -d '{
            "query": {
            "filtered": {
            "query": {
          "match_all": {}},  "filter": {
            "and": [  {
          "term": {
          "user_id": 1234}},  {
            "has_parent": {
            "type": "post",  "query": {
            "term": {
          "title": "bolivia"}  }  }  }  ]  }  }  },  "sort": [  {
           "rating" : {
          "order" : "desc"} }  ] }'  
    

   
 

This returns the rating objects – you’d have to then fetch the corresponding posts with a separate query.

Faking it

If you’re stuck on an older version of elasticsearch, you can get most of the way there withtop_children. As the documentation says top_children first queries the child documents and then aggregates them into parent documents. In our example, this means that elasticsearch will first find the rating documents that match our query. Then it will match each rating to its parent post, aggregating duplicate post where they exist.

The fiddly bit with top children is that elasticsearch doesn’t know ahead of time how many documents it will lose when the aggregation happens. In this particular case it’s easy because two distinct ratings by the same user always correspond to two distinct posts, so we don’t need to bother with factor and incremental_factor settings because the aggregation phase never does anything. Similarly, score mode doesn’t matter either. If you need to provide an accurate count of the total number of results you just need to set factor big enough that the first sweep elasticsearch does of the child documents finds all of them. If you know that the user has 500 rated items on their list and you’re asking for the first 10 items, then a factor of 50 should do the trick. Factor need only be an upper bound – you don’t need to know exactly how many items the user has on their list (which might be awkward to work out without a separate elastic search query if the user is searching a specific subset of their ratings).

What you get after all this is a list of parent documents (posts) sorted by the query score of the children documents (ratings). To achieve the original goal of sorting the posts based on attributes of the children documents we just need to ensure that this query score has the right value. For example, have top_children query wrap acustom_score query so that you have control on what the score for each child is.

With the same documents in the index, this query returns the posts that user 1234 has rated, ordered by their rating:

 
      curl -XPOST localhost:9200/posts/_search -d '{
            "query": {
            "top_children":{
            "type": "rating",  "query": {
            "custom_score": {
            "script": "doc['\''rating'\''].value",  "query": {
            "filtered": {
            "query": {
          "match_all": {}},  "filter": {
          "term": {
          "user_id": 1234}}  }  }  }  }  }  } }'

We’re running a top_children query, so the first thing we need to do is to say what is the type of the children we’re considering (rating). Then we provide the query that finds those children. This is acustom_score query, wrapping a filtered query. The filtered query ensures that we find only ratings given by the user we’re interested in, and then thescript element makes the score of the rating document be rating itself, so that we get our posts sorted by rating. The funkiness with the backslashes is only because I’m trying to include a literal single quote in a shell friendly string delimited by single quotes – the actual json that we’re sending has just "script": "doc['rating'].value".