Introduction

In this post we’ll create a brand new MongoDB Atlas cluster and create a search index that uses a custome analyzer to strip html out from the collection field we are searching.

Problem

By default, Atlas search will generate a standard english text index that will tokenize pretty much every character it finds. This leads to higher disk usage, but more importantly causes search results to be polluted with html tags and attributes that you likely don’t want to be searching.

Solution

Atlas Search allows you to specify custom analyzers for the fields you are indexing. A Google search returned the htmlStrip character filter, but it took me a few minutes to understand how to use. With this filter enabled, all html tags get stripped from the index, reducing disk usage and improving search results.

Creating a Cluster From Scrach

Create Project

image

Create a database

image

Select Free Shared Instance

image

image

image

Add Data

image

Create Dummy Collection For Testing

image

Create Search Index

image

JSON Editor

The visual editor does not allow adding custom analyzers from the visual editor, so we’ll need to use the JSON editor.

image

Paste the Following JSON

The analyzers array defines a new analyzer that can be used by the index. We then specificly attach that analyzer to the body field (This ensures that other fields that don’t need this analyzer continue using the default).

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "body": {
        "analyzer": "htmlStrippingAnalyzer",
        "searchAnalyzer": "htmlStrippingAnalyzer",
        "type": "string"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [
        {
          "type": "htmlStrip"
        }
      ],
      "name": "htmlStrippingAnalyzer",
      "tokenFilters": [],
      "tokenizer": {
        "type": "standard"
      }
    }
  ]
}

image

image

image

Run Some Tests

Now that you’ve successfully created the index with the htmlStrip analyzer, create some sample documents and confirm it behaves as expected.

Insert a Test Document

{
  "body": "<body><a href='https://devtails.xyz'>this</a> should exclude html</body>"
}

image

Run a Test Query

Try searching for “body” and notice that it doesn’t show up in the results.

image

Now search for something that is part of the text like “html” and confirm that Atlas Search returns the document.

image

Conclusion

In this post we’ve covered adding a custom analyzer to an Atlas Search index to filter out html characters. The documentation can be a bit difficult to parse at times, but once found Atlas Search generally supports what you probably want.