Facets And Counts Text Search

Minimum MongoDB Version: 4.4    (due to use of the facet option in the $searchMeta stage)

Scenario

You help run a bank's call centre and want to analyse the summary descriptions of customer telephone enquiries recorded by call centre staff. You want to look for customer calls that mention fraud and understand what periods of a specific day these fraud-related calls occur. This insight will help the bank plan its future staffing rotas for the fraud department.

To execute this example, you need to be using an Atlas Cluster rather than a self-managed MongoDB deployment. The simplest way to achieve this is to provision a Free Tier Atlas Cluster.

Sample Data Population

Drop any old version of the database (if it exists) and then populate a new enquiries collection with new records:

db = db.getSiblingDB("book-facets-text-search");
db.enquiries.remove({});

// Insert records into the enquiries collection
db.enquiries.insertMany([
  {
    "acountId": "9913183",
    "datetime": ISODate("2022-01-30T08:35:52Z"),
    "summary": "They just made a balance enquiry only - no other issues",
  },
  {
    "acountId": "9913183",
    "datetime": ISODate("2022-01-30T09:32:07Z"),
    "summary": "Reported suspected fraud - froze cards, initiated chargeback on the transaction",
  },
  {
    "acountId": "6830859",
    "datetime": ISODate("2022-01-30T10:25:37Z"),
    "summary": "Customer said they didn't make one of the transactions which could be fraud - passed on to the investigations team",
  },
  {
    "acountId": "9899216",
    "datetime": ISODate("2022-01-30T11:13:32Z"),
    "summary": "Struggling financially this month hence requiring extended overdraft - increased limit to 500 for 2 monts",
  },  
  {
    "acountId": "1766583",
    "datetime": ISODate("2022-01-30T10:56:53Z"),
    "summary": "Fraud reported - fradulent direct debit established 3 months ago - removed instruction and reported to crime team",
  },
  {
    "acountId": "9310399",
    "datetime": ISODate("2022-01-30T14:04:48Z"),
    "summary": "Customer rang on mobile whilst fraud call in progress on home phone to check if it was valid - advised to hang up",
  },
  {
    "acountId": "4542001",
    "datetime": ISODate("2022-01-30T16:55:46Z"),
    "summary": "Enquiring for loan - approved standard loan for 6000 over 4 years",
  },
  {
    "acountId": "7387756",
    "datetime": ISODate("2022-01-30T17:49:32Z"),
    "summary": "Froze customer account when they called in as multiple fraud transactions appearing even whilst call was active",
  },
  {
    "acountId": "3987992",
    "datetime": ISODate("2022-01-30T22:49:44Z"),
    "summary": "Customer called claiming fraud for a transaction which confirmed looks suspicious and so issued chargeback",
  },
  {
    "acountId": "7362872",
    "datetime": ISODate("2022-01-31T07:07:14Z"),
    "summary": "Worst case of fraud I've ever seen - customer lost millions - escalated to our high value team",
  },
]);

 

Now, using the simple procedure described in the Create Atlas Search Index appendix, define a Search Index. Select the new database collection book-facets-text-search.enquiries and enter the following JSON search index definition:

{
  "analyzer": "lucene.english",
  "searchAnalyzer": "lucene.english",
  "mappings": {
    "dynamic": true,
    "fields": {
      "datetime": [
        {"type": "date"},
        {"type": "dateFacet"}
      ]
    }
  }
}

This definition indicates that the index should use the lucene-english analyzer. It includes an explicit mapping for the datetime field to ask for the field to be indexed in two ways to simultaneously support a date range filter and faceting from the same pipeline. The mapping indicates that all other document fields will be searchable with inferred data types.

Aggregation Pipeline

Define a pipeline ready to perform the aggregation:

var pipeline = [
  // For 1 day match 'fraud' enquiries, grouped into periods of the day, counting them
  {"$searchMeta": {
    "index": "default",    
    "facet": {
      "operator": {
        "compound": {
          "must": [
            {"text": {
              "path": "summary",
              "query": "fraud",
            }},
          ],
          "filter": [
            {"range": {
              "path": "datetime",
              "gte": ISODate("2022-01-30"),
              "lt": ISODate("2022-01-31"),
            }},
          ],
        },
      },
      "facets": {        
        "fraudEnquiryPeriods": {
          "type": "date",
          "path": "datetime",
          "boundaries": [
            ISODate("2022-01-30T00:00:00.000Z"),
            ISODate("2022-01-30T06:00:00.000Z"),
            ISODate("2022-01-30T12:00:00.000Z"),
            ISODate("2022-01-30T18:00:00.000Z"),
            ISODate("2022-01-31T00:00:00.000Z"),
          ],
        }            
      }        
    }           
  }},
];

Execution

Execute the aggregation using the defined pipeline:

db.enquiries.aggregate(pipeline);

Note, it is not currently possible to view the explain plan for a $searchMeta based aggregation.

Expected Results

The results should show the pipeline matched 6 documents for a specific day on the text fraud, spread out over the four 6-hour periods, as shown below:

[
  {
    count: { lowerBound: Long("6") },
    facet: {
      fraudEnquiryPeriods: {
        buckets: [
          {
            _id: ISODate("2022-01-30T00:00:00.000Z"),
            count: Long("0")
          },
          {
            _id: ISODate("2022-01-30T06:00:00.000Z"),
            count: Long("3")
          },
          {
            _id: ISODate("2022-01-30T12:00:00.000Z"),
            count: Long("2")
          },
          {
            _id: ISODate("2022-01-30T18:00:00.000Z"),
            count: Long("1")
          }
        ]
      }
    }
  }
]

If you don't see any facet results and the value of count is zero, double-check that the system has finished generating your new index.

Observations

  • Search Metadata Stage. The $searchMeta stage is only available in aggregation pipelines run against an Atlas-based MongoDB database which leverages Atlas Search. A $searchMeta stage must be the first stage of an aggregation pipeline, and under the covers, it performs a text search operation against an internally synchronised Lucene full-text index. However, it is different from the $search operator used in the earlier search example chapter. Instead, you use $searchMeta to ask the system to return metadata about the text search you executed, such as the match count, rather than returning the search result records. The $searchMeta stage takes a facet option, which takes two options, operator and facet, which you use to define the text search criteria and categorise the results in groups.

  • Date Range Filter. The pipeline uses a $text operator for matching descriptions containing the term fraud. Additionally, the search criteria include a $range operator. The $range operator allows you to match records between two numbers or two dates. The example pipeline applies a date range, only including documents where each datetime field's value is 30-January-2022.

  • Facet Boundaries. The pipeline uses a facet collector to group metadata results by date range boundaries. Each boundary in the example defines a 6-hour period of the same specific day for a document's datetime field. A single pipeline can declare multiple facets; hence you give each facet a different name. The pipeline only defines one facet in this example, labelling it fraudEnquiryPeriods. When the pipeline executes, it returns the total count of matched documents and the count of matches in each facet grouping. There were no fraud-related enquiries between midnight and 6am, indicating that perhaps the fraud department only requires "skeleton-staffing" for such periods. In contrast, the period between 6am and midday shows the highest number of fraud-related enquiries, suggesting the bank dedicates additional staff to those periods.

  • Faster Facet Counts. A faceted index is a special type of Lucene index optimised to compute counts of dataset categories. An application can leverage the index to offload much of the work required to analyse facets ahead of time, thus avoiding some of the latency costs when invoking a faceted search at runtime. Therefore use the Atlas faceted search capability if you are in a position to adopt Atlas Search, rather than using MongoDB's general-purpose faceted search capability described in an earlier example in this book.

  • Combining A Search Operation With Metadata. In this example, a pipeline uses $searchMeta to obtain metadata from a search (counts and facets). What if you also want the actual search results from running $search similar to the previous example? You could invoke two operations from your client application, one to retrieve the search results and one to retrieve the metadata results. However, Atlas Search provides a way of obtaining both aspects within a single aggregation. Instead of using a $searchMeta stage, you use a $search stage. The pipeline automatically stores its metadata in the $$SEARCH_META variable, ready for you to access it via subsequent stages in the same pipeline. For example:

    {"$set": {"mymetadata": "$$SEARCH_META"}}