MongoDB and a mystery of long running queries

Indexes are critical in MongoDB for optimal query performance, but they have some secrets!

MongoDB and a mystery of long running queries

In one of the projects I’m involved in, MongoDB is used to store data for services. We were seeing some unexpected behavior in one part of the system, and we were able to narrow down the issue to be a long-running query against one of the collections. Due to the size of that collection, the query was taking up to 2.5 hours!

At first, it was obvious that we need an index of some sort to speed things up. However, after inspecting the query, it was obvious that the fields that were used for querying are the same fields that are used in an index that already exists in that collection. Yet, inspecting the query plan revealed that none of the indexes present in the collection were used. So what happened actually?

Example data

To better explain what was happening, I will use an example. I have MongoDB 5.0.5 Community running locally within a Docker container. I have imported sample data from this repo: https://github.com/neelabalan/mongodb-sample-dataset. For running sample queries, I will be using the movies collection from the sample_mflix database.

movies collection has more than 23000 movies and series, including information like: plot, genres and awards. For demo purposes, I would like to get all the series that have won at least one award. Therefore, the query I will be performing will be looking like this:

db.movies.find({type: "series", "awards.wins": 1})

Querying without indexes

Sample collection contains 25 documents meeting these criteria. Let’s have a closer look at the query execution plan which Mongo gives us, if we use the explain() function on the cursor:

{
  "explainVersion": "1",
  "queryPlanner": {
    "namespace": "sample_mflix.movies",
    "indexFilterSet": false,
    "parsedQuery": {
      "$and": [{ "awards.wins": { "$eq": 1 } }, { "type": { "$eq": "series" } }]
    },
    "maxIndexedOrSolutionsReached": false,
    "maxIndexedAndSolutionsReached": false,
    "maxScansToExplodeReached": false,
    "winningPlan": {
      "stage": "COLLSCAN",
      "filter": {
        "$and": [
          { "awards.wins": { "$eq": 1 } },
          { "type": { "$eq": "series" } }
        ]
      },
      "direction": "forward"
    },
    "rejectedPlans": []
  }
// omitted for brevity
}

This plan tells us that no indexes have been used for this query and whole collection has been scanned to find matching documents. This is obviously because we haven’t defined any indexes on the collection.

So let’s create a compound index on the fields which are used for this query, which should speed up the queries and ensure that there is no need to scan the whole collection:

db.movies.createIndex( { type: 1, "awards.wins": 1 } )

We can verify that the index has been created correctly by running:

db.movies.getIndexes()

In the output we can see two indexes - default index on the _id field and index we created on type and award.wins fields:

[
  { "v": 2, "key": { "_id": 1 }, "name": "_id_" },
  {
    "v": 2,
    "key": { "type": 1, "awards.wins": 1 },
    "name": "type_1_awards.wins_1"
  }
]

If we run the query with explain() again, the winning plan clearly indicates that the new index has been utilized:

{
    "explainVersion": "1",
    "queryPlanner": {
        "namespace": "sample_mflix.movies",
        "indexFilterSet": false,
        "parsedQuery": {
            "$and": [{
                    "awards.wins": {
                        "$eq": 1
                    }
                }, {
                    "type": {
                        "$eq": "series"
                    }
                }
            ]
        },
        // omitted for brevity
        "winningPlan": {
            "stage": "FETCH",
            "inputStage": {
                "stage": "IXSCAN",
                "keyPattern": {
                    "type": 1,
                    "awards.wins": 1
                },
                "indexName": "type_1_awards.wins_1"
                // omitted for brevity
            }
        },
        "rejectedPlans": []
    }
    // omitted for brevity
}

There is a IXSCAN in the stage meaning an index has been used for the query. Following properties indicate it was the index that we have just created.

Case insensitive index

Things are getting a bit more interesting, when the index is created with a custom collation and strength. In a nutshell, such index can be a “case insensitive” index. You can find out more about case insensitive indexes here: https://docs.mongodb.com/manual/core/index-case-insensitive/. According to the documentation:

Using a case insensitive index does not affect the results of a query, but it can increase performance.

So let’s drop our index:

db.movies.dropIndexes("type_1_awards.wins_1")

And create new index, using the same fields as previously, but now let’s also specify collation settings:

db.movies.createIndex( { type: 1, "awards.wins": 1 }, { collation: { locale: 'en', strength: 1 } } )

There are more parameters that can be leveraged, but for case insensitivity strength parameter is important. If you are interested, you can find out more here: https://docs.mongodb.com/manual/reference/collation/#std-label-collation-document-fields. Long story short, when strength is set to 1 or 2, index is considered case insensitive.

If we run our query again, we will see that it did not leverage the index that was just created, but instead it scanned the whole collection again:

{
    "explainVersion": "1",
    "queryPlanner": {
        "namespace": "sample_mflix.movies",
        "indexFilterSet": false,
        "parsedQuery": {
            "$and": [{
                    "awards.wins": {
                        "$eq": 1
                    }
                }, {
                    "type": {
                        "$eq": "series"
                    }
                }
            ]
        },
        // omitted for brevity
        "winningPlan": {
            "stage": "COLLSCAN",
            "filter": {
                "$and": [{
                        "awards.wins": {
                            "$eq": 1
                        }
                    }, {
                        "type": {
                            "$eq": "series"
                        }
                    }
                ]
            },
            "direction": "forward"
        },
        "rejectedPlans": []
    },
    // omitted for brevity
}

According to the documentation, in order to take advantage of indexes with collation:

You must specify the same collation at the query level in order to use the index-level collation.

When we add collation to our query:

db.movies.find({type: "series", "awards.wins": 1}).collation({locale: "en", strength: 1}).explain()

From the explain() output, we can see that collation has been applied on the query and that the index has been leveraged:

{
    "explainVersion": "1",
    "queryPlanner": {
        "namespace": "sample_mflix.movies",
        "indexFilterSet": false,
        "parsedQuery": {
            "$and": [{
                    "awards.wins": {
                        "$eq": 1
                    }
                }, {
                    "type": {
                        "$eq": "series"
                    }
                }
            ]
        },
        "collation": {
            "locale": "en",
            "caseLevel": false,
            "caseFirst": "off",
            "strength": 1,
            "numericOrdering": false,
            "alternate": "non-ignorable",
            "maxVariable": "punct",
            "normalization": false,
            "backwards": false,
            "version": "57.1"
        },
       // omitted for brevity
        "winningPlan": {
            "stage": "FETCH",
            "inputStage": {
                "stage": "IXSCAN",
                "keyPattern": {
                    "type": 1,
                    "awards.wins": 1
                },
                "indexName": "type_1_awards.wins_1",
                "collation": {
                    "locale": "en",
                    "caseLevel": false,
                    "caseFirst": "off",
                    "strength": 1,
                    "numericOrdering": false,
                    "alternate": "non-ignorable",
                    "maxVariable": "punct",
                    "normalization": false,
                    "backwards": false,
                    "version": "57.1"
                },
               // omitted for brevity
            }
        },
        "rejectedPlans": []
    },
    "command": {
        "find": "movies",
        "filter": {
            "type": "series",
            "awards.wins": 1
        },
        "collation": {
            "locale": "en",
            "strength": 1
        },
        "$db": "sample_mflix"
    }
    // omitted for brevity
}

Conclusion

Creating and using indexes is essential when working with any database engine, not only MongoDB, especially when the datasets are big.

It is important to remember that indexes in MongoDB can be defined as case insensitive, which can bring more performance gains compared to regular indexes.

On the flip side, we have to remember, that in order to fully utilize such indexes, collation on the query has to match the collation of the index. Otherwise, queries might be falling back to whole collection scan, which can lead to unbelievably long queries.

Happy querying!

Cover image by Morgane Le Breton from Unsplash