Strongly-Typed Conversion
Minimum MongoDB Version: 4.2
Scenario
A 3rd party has imported a set of retail orders into a MongoDB collection but with all data typing lost (it stored all field values as strings). You want to re-establish correct typing for all the documents and copy them into a new "cleaned" collection. You can incorporate such type transformation logic in the aggregation pipeline because you know the type each field had in the original record structure.
Unlike most examples in this book, the aggregation pipeline writes its output to a collection rather than streaming the results back to the calling application.
Sample Data Population
Drop any old version of the database (if it exists) and then populate a new orders
collection with three orders documents, where each order has text fields only (note, the second document is intentionality missing the field reported
in the sub-document further_info
):
db = db.getSiblingDB("book-convert-to-strongly-typed");
db.dropDatabase();
// Insert orders documents
db.orders.insertMany([
{
"customer_id": "elise_smith@myemail.com",
"order_date": "2020-05-30T08:35:52",
"value": "231.43",
"further_info": {
"item_qty": "3",
"reported": "false",
},
},
{
"customer_id": "oranieri@warmmail.com",
"order_date": "2020-01-01T08:25:37",
"value": "63.13",
"further_info": {
"item_qty": "2",
},
},
{
"customer_id": "tj@wheresmyemail.com",
"order_date": "2019-05-28T19:13:32",
"value": "2.01",
"further_info": {
"item_qty": "1",
"reported": "true",
},
},
]);
Aggregation Pipeline
Define a pipeline ready to perform the aggregation:
var pipeline = [
// Convert strings to required types
{"$set": {
"order_date": {"$toDate": "$order_date"},
"value": {"$toDecimal": "$value"},
"further_info.item_qty": {"$toInt": "$further_info.item_qty"},
"further_info.reported": {"$switch": {
"branches": [
{"case": {"$eq": [{"$toLower": "$further_info.reported"}, "true"]}, "then": true},
{"case": {"$eq": [{"$toLower": "$further_info.reported"}, "false"]}, "then": false},
],
"default": {"$ifNull": ["$further_info.reported", "$$REMOVE"]},
}},
}},
// Output to an unsharded or sharded collection
{"$merge": {
"into": "orders_typed",
}},
];
Execution
Execute the aggregation using the defined pipeline to generate and populate a new collection called orders_typed
:
db.orders.aggregate(pipeline);
Check the contents of the new orders_typed
collection to ensure the relevant fields are now appropriately typed:
db.orders_typed.find();
View the explain plan for the aggregation pipeline:
db.orders.explain("executionStats").aggregate(pipeline);
Expected Results
The same number of documents should appear in the new orders_typed
collection as the source collection had, with the same field structure and fields names, but now using strongly-typed boolean/date/integer/decimal values where appropriate, as shown below:
[
{
_id: ObjectId('6064381b7aa89666258201fd'),
customer_id: 'elise_smith@myemail.com',
further_info: {
item_qty: 3,
reported: false
},
order_date: ISODate('2020-05-30T08:35:52.000Z'),
value: NumberDecimal('231.43')
},
{
_id: ObjectId('6064381b7aa89666258201fe'),
customer_id: 'oranieri@warmmail.com',
further_info: {
item_qty: 2
},
order_date: ISODate('2020-01-01T08:25:37.000Z'),
value: NumberDecimal('63.13')
},
{
_id: ObjectId('6064381b7aa89666258201ff'),
customer_id: 'tj@wheresmyemail.com',\
further_info: {
item_qty: 1,
reported: true
},
order_date: ISODate('2019-05-28T19:13:32.000Z'),
value: NumberDecimal('2.01')
}
]
Observations
-
Boolean Conversion. The pipeline's conversions for integers, decimals, and dates are straightforward using the corresponding operator expressions,
$toInt
,$toDecimal
and$toDate
. However, the operator expression$toBool
is not used for the boolean conversion. This is because$toBool
will convert any non-empty string to true regardless of its value. As a result, the pipeline uses a$switch
operator to compare the lowercase version of strings with the text'true'
and'false'
, returning the matching boolean. -
Preserving Non-Existence. The field
further_info.reported
is an optional field in this scenario. The field may not always appear in a document, as illustrated by one of the three documents in the example. If a field is not present in a document, this potentially significant fact should never be lost. The pipeline includes additional logic for thefurther_info.reported
field to preserve this information. The pipeline ensures the field is not included in the output document if it didn't exist in the source document. A$ifNull
conditional operator is used, which returns the$$REMOVE
marker flag if the field is missing, instructing the aggregation engine to omit it. -
Output To A Collection. The pipeline uses a
$merge
stage to instruct the aggregation engine to write the output to a collection rather than returning a stream of results. For this example, the default settings for$merge
are sufficient. Each transformed record coming out of the aggregation pipeline becomes a new record in the target collection. The pipeline could have used a$out
rather than a$merge
stage. However, because$merge
supports both unsharded and sharded collections, whereas$out
only supports the former,$merge
provides a more universally applicable example. If your aggregation needs to create a brand new unsharded collection,$out
may be a little faster because the aggregation will completely replace the existing collection if it exists. Using$merge
, the system has to perform more checks for every record the aggregation inserts (even though, in this case, it will be to a new collection). -
Trickier Date Conversions. In this example, the date strings contain all the date parts required by the
$toDate
operator to perform a conversion correctly. In some situations, this may not be the case, and a date string may be missing some valuable information (e.g. which century a 2-character year string is for, such as the century19
or21
). To understand how to deal with these cases, see the Convert Incomplete Date Strings example chapter.