Aggregating Data with MongoDB

Akshad Kolhatkar
6 min readJul 12, 2021

Dealing with NoSQL Databases the right way…

Starting with the journey as a Data Engineer at Lishash, I believe understanding the Database tools is very critical which turns complex data handling into a cakewalk.

Fair warning, this is going to get a little complicated so I will try to explain it in layman’s terms, although there will be some query analogies and Pymongo terminologies involved — but nothing more complicated than middle school computing science… I hope.

Getting Started

Before diving deep into the workflow of MongoDB querying, lets understand what are NoSQL Databases in general.

NoSQL Databases :

NoSQL databases (aka “not only SQL”) are non tabular, and store data differently than relational tables. NoSQL databases come in a variety of types based on their data model. The main types are document, key-value, wide-column, and graph. They provide flexible schemas and scale easily with large amounts of data and high user loads.

A common misconception is that NoSQL databases or non-relational databases don’t store relationship data well. NoSQL databases can store relationship data — they just store it differently than relational databases do.

Types of NoSQL Databases include — Document databases, Key-value databases, Wide-column stores and Graph databases [which I’ll surely talk about in a later blog].

MongoDB :

MongoDB is a cross-platform, document oriented database that provides, high performance, high availability, and easy scalability. MongoDB works on concept of collection and documents, which means it stores the data records as documents (specifically BSON documents) which are gathered together in collections. A database stores one or more collections of documents.

When you start with MongoDB, you will use the find() command for querying data and it will probably be sufficient, but as soon as you start doing anything more advanced than data retrieval, you will need to know more about the MongoDB aggregation pipeline.

Using the CRUD’s find operation while fetching data in MongoDB may sometimes become tedious. For instance, I may want to fetch some embedded documents in a given field but the find({}) operation will always fetch the main document and then it will be upon me to filter this data and select a field with all the embedded documents, scan through it to get ones that match your criteria. Since there is no shorter way to do this, I will be forced to use one form of a loop operation to go through all these subdocuments until I get the matching results.

However, when I have a million embedded documents, the process will take a lot of your server’s random memory and may terminate the process before I get all the documents you wanted, as the server document size may be surpassed.

This limitations associated with a large number of documents, leads us to the technique of grouping them to enhance the scanning process.

What if I tell you there is an easier way to deal with it.

*Enter the Aggregation Tools in MongoDB…

Aggregation in MongoDB

In MongoDB, aggregation operations process the data records/documents and return computed results. It collects values from various documents and groups them together and then performs different types of operations on that grouped data like sum, average, minimum, maximum, etc to return a computed result. It is similar to the aggregate function of SQL.

But what exactly is Database Aggregation ?

Lets think it this way.

When it’s time to gather the metrics from MongoDB, maybe for some graphical representation or some other operation, there is no better technique than MongoDB aggregations. Aggregation in MongoDB is an operation used to process the data that returns the computed results. Aggregation basically groups the data from multiple documents and operates in many ways on those grouped data in order to return one combined result. It processes documents and return computed results and can perform a variety of operations on the grouped data to return a single result.

Aggregation Pipelines

In MongoDB, the aggregation pipeline consists of stages and each stage transforms the document. Or in other words, the aggregation pipeline is a multi-stage pipeline, so in each state, the documents taken as input and produce the resultant set of documents now in the next stage(id available) the resultant documents taken as input and produce output, this process is going on till the last stage.

The basic pipeline stages provide filters that will perform like queries and the document transformation modifies the resultant document and the other pipeline provides tools for grouping and sorting documents. You can also use the aggregation pipeline in sharded collections.

MongoDB’s aggregate function will cluster out the records in the form of a collection which can be then employed for providing operations like total number(sum), mean, minimum and maximum, etc. from the aggregated group of data extracted.

Here, the aggregate() function is used to perform aggregation it can have three operators stages, expression and accumulator.

Stages:

Each stage starts from stage operators which are:

  • $match: It is used for filtering the documents can reduce the amount of documents that are given as input to the next stage.
  • $project: It is used to select some specific fields from a collection.
  • $group: It is used to group documents based on some value.
  • $sort: It is used to sort the document that is rearranging them
  • $skip: It is used to skip n number of documents and passes the remaining documents
  • $limit: It is used to pass first n number of documents thus limiting them.
  • $unwind: It is used to unwind documents that are using arrays i.e. it deconstructs an array field in the documents to return documents for each element.
  • $out: It is used to write resulting documents to a new collection

Expressions:

It refers to the name of the field in input documents.

for e.g: { $group : { _id : "$id", total : {$sum:"$fare"}}}

here $id and $fare are expressions.

Accumulators:

These are basically used in the group stage.

  • $sum: It sums numeric values for the documents in each group
  • $count: It counts total numbers of documents
  • $avg: It calculates the average of all given values from all documents
  • $min: It gets the minimum value from all the documents
  • $max: It gets the maximum value from all the documents
  • $first: It gets the first document from the grouping
  • $last: It gets the last document from the grouping
  • $push: It feeds in the values to an array in the associated document.
  • $addToSet: It feeds in the values to an array without duplication.

Note:

  • in $group, “ _id” is Mandatory field
  • $out must be the last stage in the pipeline
  • $sum :1 will count the number of documents and $sum:”$fare” will give the sum of total fare generated per id.

Syntax

So now, enough of the Docs. Lets try to implement some code.

This is an example of how we build the aggregation query:

pipeline = [
{ $match : { … },
{ $group : { … },
{ $sort : { … },
...
]
db.collectionName.aggregate(pipeline, options)

Implementation

Consider a Schema containing a Music Track information :

{
"_id": ObjectId("60ebec19166987268cb7edfa"),
"name": "Wizard",
"release_date": '2013-06-27',
"release_year": 2013,
"artists": [
{"name": "jay hardway"},
{"name": "martin garrix"}],


"genres": [
{"name": "electro"},
{"name": "edm"},
{"name": "electronic"},
{"name": "10s"}],

"languages": ["english"],

"album": {"name": "wizard"},
"preview_url": "https://open.spotify.com/track/6JYOQ9AM0byHsOvh6hezJl?si=yjFCrqOmTg-_TEuD7A9EHQ&dl_branch=1"}

Now lets consider you are provided with a data requirement stating the following sequence :

Q] Return a list of top 3 tracks that are in hindi language, having a genre of “Bollywood”, but the artist should only be “Mohit Chauhan”. The tracks should have a release year in the range 2000–2010. The end result should be sorted by release and group only the name, release date, and preview url of the track.

Lets follow all the steps sequentially and start building our own aggregation pipeline:

pipeline = [{"$match": {
"languages.name": "hindi",
"genres.name": "bollywood",
"artists.name": {"$eq": "mohit chauhan"},
"release_year": {"$gte": 2000,"$lt": 2010},
}},
{"$group": {
"_id": "$_id",
"name": {"$first": "$name"},
"release_date": {"$first": "$release_date"},
"preview_url": {"$first": "$preview_url"},
}},
{"$sort": {"release_date": -1}},

{"$limit": 3}

]
res = list(alpha_db.track_info.aggregate(pipeline))
print(res)

Output :

[{'_id': ObjectId('60ebe2c99b136b4fe9b09b51'),
'name': 'Is Jahaan Mein',
'release_date': '2009-09-07',
'preview_url': 'https://open.spotify.com/track/5yETBj2hr5RMkECGhyCakJ?si=lXA47dsOQBG3mmr2AfOJiw&dl_branch=1'},
{'_id': ObjectId('60ebe2e2a3cabf8ec5b0e983'),
'name': 'Pehli Baar Mohabbat',
'release_date': '2009-07-06',
'preview_url': 'https://open.spotify.com/track/58xpZwxUpgrnJMTEmvkZMP?si=Cnq8ng7nTSutIGVM9yuqWA&dl_branch=1'},
{'_id': ObjectId('60ebe2fcb4dfd82eedb11832'),
'name': 'Dooriyan',
'release_date': '2009-06-27',
'preview_url': 'https://open.spotify.com/track/4peiaLdhDgUgQ646Gpbvh3?si=LpNZRoytSJ6a_8_vZWlyjQ&dl_branch=1'}]

Hence we have successfully fetched the exact matching entries and grouped them instantaneously within a fraction of milliseconds.

That’s it Folks. Catch you guys in the Next one soon…

--

--