Introduction
In this post we’ll create a brand new MongoDB Atlas cluster and create a search index that uses a custome analyzer to strip html out from the collection field we are searching.
Problem
By default, Atlas search will generate a standard english text index that will tokenize pretty much every character it finds. This leads to higher disk usage, but more importantly causes search results to be polluted with html tags and attributes that you likely don’t want to be searching.
Solution
Atlas Search allows you to specify custom analyzers for the fields you are indexing. A Google search returned the htmlStrip character filter, but it took me a few minutes to understand how to use. With this filter enabled, all html tags get stripped from the index, reducing disk usage and improving search results.
Creating a Cluster From Scrach
Create Project
Create a database
Select Free Shared Instance
Add Data
Create Dummy Collection For Testing
Create Search Index
JSON Editor
The visual editor does not allow adding custom analyzers from the visual editor, so we’ll need to use the JSON editor.
Paste the Following JSON
The analyzers
array defines a new analyzer that can be used by the index. We then specificly attach that analyzer to the body
field (This ensures that other fields that don’t need this analyzer continue using the default).
{
"mappings": {
"dynamic": false,
"fields": {
"body": {
"analyzer": "htmlStrippingAnalyzer",
"searchAnalyzer": "htmlStrippingAnalyzer",
"type": "string"
}
}
},
"analyzers": [
{
"charFilters": [
{
"type": "htmlStrip"
}
],
"name": "htmlStrippingAnalyzer",
"tokenFilters": [],
"tokenizer": {
"type": "standard"
}
}
]
}
Run Some Tests
Now that you’ve successfully created the index with the htmlStrip analyzer, create some sample documents and confirm it behaves as expected.
Insert a Test Document
{
"body": "<body><a href='https://devtails.xyz'>this</a> should exclude html</body>"
}
Run a Test Query
Try searching for “body” and notice that it doesn’t show up in the results.
Now search for something that is part of the text like “html” and confirm that Atlas Search returns the document.
Conclusion
In this post we’ve covered adding a custom analyzer to an Atlas Search index to filter out html characters. The documentation can be a bit difficult to parse at times, but once found Atlas Search generally supports what you probably want.