How to Search HTML Using MongoDB Atlas Search

Introduction

In this post we’ll create a brand new MongoDB Atlas cluster and create a search index that uses a custome analyzer to strip html out from the collection field we are searching.

Problem

By default, Atlas search will generate a standard english text index that will tokenize pretty much every character it finds. This leads to higher disk usage, but more importantly causes search results to be polluted with html tags and attributes that you likely don’t want to be searching.

Solution

Atlas Search allows you to specify custom analyzers for the fields you are indexing. A Google search returned the htmlStrip character filter, but it took me a few minutes to understand how to use. With this filter enabled, all html tags get stripped from the index, reducing disk usage and improving search results.

Creating a Cluster From Scrach

Create Project

Create a database

Select Free Shared Instance

Add Data

Create Dummy Collection For Testing

Create Search Index

JSON Editor

The visual editor does not allow adding custom analyzers from the visual editor, so we’ll need to use the JSON editor.

Paste the Following JSON

The analyzers array defines a new analyzer that can be used by the index. We then specificly attach that analyzer to the body field (This ensures that other fields that don’t need this analyzer continue using the default).

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "body": {
        "analyzer": "htmlStrippingAnalyzer",
        "searchAnalyzer": "htmlStrippingAnalyzer",
        "type": "string"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [
        {
          "type": "htmlStrip"
        }
      ],
      "name": "htmlStrippingAnalyzer",
      "tokenFilters": [],
      "tokenizer": {
        "type": "standard"
      }
    }
  ]
}

Run Some Tests

Now that you’ve successfully created the index with the htmlStrip analyzer, create some sample documents and confirm it behaves as expected.

Insert a Test Document

{
  "body": "<body><a href='https://devtails.xyz'>this</a> should exclude html</body>"
}

Run a Test Query

Try searching for “body” and notice that it doesn’t show up in the results.

Now search for something that is part of the text like “html” and confirm that Atlas Search returns the document.

Conclusion

In this post we’ve covered adding a custom analyzer to an Atlas Search index to filter out html characters. The documentation can be a bit difficult to parse at times, but once found Atlas Search generally supports what you probably want.

How to Search HTML Using MongoDB Atlas Search

Introduction

Problem

Solution

Creating a Cluster From Scrach

Create Project

Create a database

Select Free Shared Instance

Add Data

Create Dummy Collection For Testing

Create Search Index

JSON Editor

Paste the Following JSON

Run Some Tests

Insert a Test Document

Run a Test Query

Conclusion

3 Lines of Code Shouldn't Take All Day

Hey Siri, We're Breaking Up

Taking Flight Without a Smart Phone

How to Replace Webpack in Create React App With esbuild

Introduction

Problem

Solution

Creating a Cluster From Scrach

Create Project

Create a database

Select Free Shared Instance

Add Data

Create Dummy Collection For Testing

Create Search Index

JSON Editor

Paste the Following JSON

Run Some Tests

Insert a Test Document

Run a Test Query

Conclusion

Subscribe to Monthly Newsletter

3 Lines of Code Shouldn't Take All Day

Hey Siri, We're Breaking Up

Taking Flight Without a Smart Phone

How to Replace Webpack in Create React App With esbuild