Prior to version 2.1, if you wanted to do aggregation in MongoDB (grouping documents by key(s), finding total and average values etc.) you had to use MapReduce. Although a flexible and powerful tool for data analysis, many programmers (including myself) find using it unintuitive. It takes some effort to fully grasp its underlying concepts, before you start wielding its true power. Also performing MapReduce in an application development language (other than JavaScript), feels a bit ‘hackish’, to say the least. You have to write your map and reduce functions in JavaScript, use the code as native strings in your application which have to be sent to the MongoDB server through the driver. If the map and reduce are complex, the readability and maintainability of your entire application code may take a hit. The Aggregation Framework provides flexible way to specify and execute aggregation queries in MongoDB. Available since version 2.1, it is an easier alternative to MapReduce. Of course, it has some quirks of its own, but I’ll get to that later.
Basic Concept
The underlying concept aggregation framework is this: the documents in your collection pass through the aggregation pipeline. The pipeline is a series of operations that are performed on the documents in the sequence they are defined. The idea is borrowed from the pipes (|) operator in UNIX shell programming. The result of one operation in the pipeline is fed into the next one as input. An operation is defined through expressions. An expression specifies the transformation that is to be applied on the input document. Expressions are generally stateless, they don’t store an information of the documents they process. However there is a special kind of expression known as accumulators, who preserve the state.
Web Analytics using Aggregation Framework
Now that we are familiar with the theories, it’s time to see the framework in action by using it to solve a real world problem. Web analytics seems like a good candidate. If your are logging the traffic to your website in MongoDB, you can run aggregation queries on the data to measure key metrics (most popular webpage, the busiest day in month etc.).
I’ve setup a GitHub repo with the sample code and examples used in this blog post, in case the code is difficult read on this site.
Populating sample Data
I wrote a script to generate some 1000 dummy documents into a MongoDB database. Here’s what a document looks like:
{ "page" : "/2011/01/02/blog-stats-2010/", "visited_at" : ISODate("2012-12-11T10:04:28Z"), "user_agent" : "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1", "ip_address" : "23.22.63.122", "response_time_ms" : 194.53 }
Each document represents a page visit by user on the site. The page
field is the link of the webpage visited, visited_at
is the time of visit, user_agent
is the value of User Agent string of whatever browser/device the user used when seeing the page, ip_address
is the IP address of the user, and finally response_time_ms
is the time (in milliseconds) the web server took to serve the page. Assuming you are running the MongoDB server on your machine (on localhost, port 27017), you can load the sample data by running the script in mongo client:
$mongo localhost:27017/mysampledb --quite /path/to/load_data.js
If the script finishes without any glitch, you should have a collection web_traffic
with 1000 documents in the database mysampledb
.
Calculating total visits per day
The following example shows how to calculate the total number of visits on the website for the month of December, 2013.
db.web_traffic.aggregate( [ { $match: {visited_at: {$gte: new Date(2013, 11, 1), $lt: new Date(2014, 0, 1)}} }, { $project: { _id: 0, day_of_month: {$dayOfMonth: "$visited_at"}} }, { $group: { _id: {day_of_month: "$day_of_month"}, hits: {$sum: 1}} }, { $sort: {hits: -1} } ])
Let’s examine the pipeline one operation at a time.
- The first operation in the pipeline is
$match
, it selects the documents whosevisited_on
values are between December 1st to December 31st 2013. - Next we have
$project
which reshapes the documents in the stream by adding, removing or changing one or more fields. In this case, we remove the_id
field by setting it’s value to zero (_id
is added by default so we have to remove it explicitly. The other fields are removed automatically if we don’t add them). We also add a new fieldday_of_month
, which is the month property ofvisited_on
. The month is extracted by using the$dayOfMonth
expression. - The
$group
operator groups documents together and calculates aggregated values.$group
must be supplied a value for the_id
field, this is expression the documents are going to be grouped by. In our example, we are grouping byday_of_month
, the field added by the preceding$project
operator. The$sum
expression is used to keep a counter, namedhits
, for everyday_of_month
seen. In SQL, you could express this as:SELECT COUNT(*) as hits FROM web_traffic GROUP BY MONTH(visited_on)
. - Finally
$sort
orders the grouped documents based on hits in descending order (-1 means descending while 1 means ascending).
The output of the operation is like this:
{
"result" : [
{
"_id" : {
"day_of_month" : 7
},
"hits" : 40
},
{
"_id" : {
"day_of_month" : 16
},
"hits" : 39
},
.....
{
"_id" : {
"day_of_month" : 20
},
"hits" : 23
}
],
"ok" : 1
The result shows us December 7th was the busiest day of the month (40 page views), whereas December 20 has been least busy (23 page views).
Calculating average, maximum and minimum response times
This next example calculated average, maximum and minimum values of the page response times (response_time_ms).
db.web_traffic.aggregate([ { $group:{ _id:null, avg_response_time_ms: {$avg: "$response_time_ms"}, max_response_time_ms: {$max: "$response_time_ms"}, min_response_time_ms: {$min: "$response_time_ms"} } }, { $project:{ _id: 0, avg_response_time_s: {$divide: ["$avg_response_time_ms", 1000]}, max_response_time_s: {$divide: ["$max_response_time_ms", 1000]}, min_response_time_s: {$divide: ["$min_response_time_ms", 1000]} } } ])
Let’s quickly go through the operations in the pipeline:
- The first is a
$group
where we set the_id
to null. This means we are not grouping by any field. We still have to calculate the average, maximum and minimum values of response times and we are doing that by applying$avg
,$max
and$min
on response_time_ms field. - The project operation suppresses the
_id
field, and at the same time divides the values by 1000 (with the$devide
expression) to transforms the millisecond values into second.
Finding usage share of visitors’ browsers
In this example we will find out the unique browsers and platforms used by the site’s visitors using the User Agent value and calculate their percentage share.
var total = db.web_traffic.find({user_agent: {$exists:true}}).count(); var pipeline = [ { $group: {_id: "$user_agent", count: {$sum: 1}} }, { $project: {percentage: {$multiply: ["$count", 100 / total]}} }, { $sort: {percentage: -1} } ]; db.web_traffic.aggregate(pipeline)
- First we find the total number of documents that have the
user_agent
field populated, store the number astotal
. - The
$group
in the pipeline groups the documents based onuser_agent
field (_id: $user_agent
) and counting their numbers (count: {$sum: 1}
). - The
$project
calculates the percentage by diving count with total and multiplying with 100. - The
$sort
sorts the result by percentage in descending order.
Finding weekly visit patterns of users
The aggregation pipeline in this example calculates how many times a user visits the website per weekday and what pages does he read. The user is represented by ip_address
in this case. In real life, websites have more sophisticated methods for tracking unique users.
db.web_traffic.aggregate([ { $project: { _id: 0, page: 1, ip_address: 1, visited_weekday: {$dayOfWeek: "$visited_at"} } }, { $group: { _id: {visited_weekday: "$visited_weekday", ip_address: "$ip_address"}, visits: {$sum: 1}, pages_visited: {$addToSet: "$page"} } }, { $sort: {"_id.visited_weekday": 1} } ])
$project
outputs documents with fields page,ip_address
and calculated on namedvisited_weekday
which is the day of week (1 is Sunday, 2 is Monday and so on) invisited_at
timestamp.- The documents are grouped on
visited_weekday
andip_address
, the total visits per IP is counted by$sum
. Also, the value of page field is added to an arraypages_visited
by the expression$addToSet
. - Sort the documents on
visited_weekday
in ascending order.
The result of this aggregation looks like this:
{ "result" : [ { "_id" : { "visited_weekday" : 1, "ip_address" : "4.242.114.0" }, "visits" : 19, "pages_visited" : [ "/2012/03/26/tab-completion-for-fabric-tasks-on-os-x", "/2010/02/05/modifying-pdf-files-with-php", "/2011/11/24/random-snaps-from-thailand-tour", "/2011/01/29/on-programming-and-craftsmanship", "/2011/03/04/getting-started-with-pipvirtualenv", "/2010/05/07/moments-from-darjeeling", "/2011/01/02/blog-stats-2010/", "/2011/05/06/making-chrome-canary-the-default-browser-os-x" ] } ...
The first entry in the result set shows user with IP address 4.242.114.0 visited the site 19 times on Sunday (weekday 1) and visited the links in the pages_visited
array.
Calculating duration of users’ visits on the website
In this last example, we will use a pipeline that calculates how much time (in hours) each user have spent on the site on December 7, 2013.
db.web_traffic.aggregate([ { $match: {visited_at: {$gte: new Date(2013, 11, 7), $lt: new Date(2013, 11, 8)}} }, { $sort: {visited_at: 1} }, { $group: { _id: "$ip_address", first_visit: {$first: "$visited_at"}, last_visit: {$last: "$visited_at"}, total_visits: {$sum: 1} } }, { $match: { total_visits: {$gt: 1} } }, { $project: { duration_hours: {$subtract: [{$hour: "$last_visit"}, {$hour: "$first_visit"}]} } }, { $sort: {duration_hours: -1} } ])
- First in the pipeline is a
$match
which selects the documents withvisited_at
on December 7 2013. - Next,
$sort
sorts the matched documents on the timestamp in ascending order. $group
groups the document onip_address
and counts total visits per IP address. It also save the first and the last value ofvisited_at
for each group.- The next
$match
gets filters out the documents withtotal_visit
of 1 or less. $project
calculates the duration of user’s visit by subtracting the first value ofvisited_at
from the last value and expressing it in hours.- Finally sorting the documents on duration in descending order.
Limitations
There are some limitations of the framework that you should be aware of:
- The final output of the aggregation pipeline is stored in a single document in memory. This is in contrast with MapReduce which allows you store the result persistently in a collection. This means the size of the aggregation result should exceed the MongoDB document size limit, which is currently 16 megabytes.
- If the operation consumes more the 10% of system RAM, it produces an error.
- Aggregation pipeline cannot operate on these data types: Symbol, MinKey, MaxKey, DBRef, Code, CodeWScope.
Recommended Reading
- The documentation of the framework on MongoDB official website.
- Reference manual of the pipeline operators and expression.
- SQL to Aggregation mapping chart.