Web Analytics using MongoDB Aggregation Framework

Prior to version 2.1, if you wanted to do aggregation in MongoDB (grouping documents by key(s), finding total and average values etc.) you had to use MapReduce. Although a flexible and powerful tool for data analysis, many programmers (including myself) find using it unintuitive. It takes some effort to fully grasp its underlying concepts, before you start wielding its true power. Also performing MapReduce in an application development language (other than JavaScript), feels a bit ‘hackish’, to say the least. You have to write your map and reduce functions in JavaScript, use the code as native strings in your application which have to be sent to the MongoDB server through the driver. If the map and reduce are complex, the readability and maintainability of your entire application code may take a hit. The Aggregation Framework provides flexible way to specify and execute aggregation queries in MongoDB. Available since version 2.1, it is an easier alternative to MapReduce. Of course, it has some quirks of its own, but I’ll get to that later.

Basic Concept

The underlying concept aggregation framework is this: the documents in your collection pass through the aggregation pipeline. The pipeline is a series of operations that are performed on the documents in the sequence they are defined. The idea is borrowed from the pipes (|) operator in UNIX shell programming. The result of one operation in the pipeline is fed into the next one as input. An operation is defined through expressions. An expression specifies the transformation that is to be applied on the input document. Expressions are generally stateless, they don’t store an information of the documents they process. However there is a special kind of expression known as accumulators, who preserve the state.

Web Analytics using Aggregation Framework

Now that we are familiar with the theories, it’s time to see the framework in action by using it to solve a real world problem. Web analytics seems like a good candidate. If your are logging the traffic to your website in MongoDB, you can run aggregation queries on the data to measure key metrics (most popular webpage, the busiest day in month etc.).

I’ve setup a GitHub repo with the sample code and examples used in this blog post, in case the code is difficult read on this site.

 Populating sample Data

I wrote a script to generate some 1000 dummy documents into a MongoDB database. Here’s what a document looks like:

{
  "page" : "/2011/01/02/blog-stats-2010/",
  "visited_at" : ISODate("2012-12-11T10:04:28Z"),
  "user_agent" : "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1",
  "ip_address" : "23.22.63.122",
  "response_time_ms" : 194.53
}

Each document represents a page visit by  user on the site. The page field is the link of the webpage visited, visited_at is the time of visit, user_agent is the value of User Agent string of whatever browser/device the user used when seeing the page, ip_address is the IP address of the user, and finally response_time_ms is the time (in milliseconds) the web server took to serve the page. Assuming you are running the MongoDB server on your machine (on localhost, port 27017), you can load the sample data by running the script in mongo client:

$mongo localhost:27017/mysampledb --quite /path/to/load_data.js

If the script finishes without any glitch, you should have a collection web_traffic with 1000 documents in the database mysampledb.

Calculating total visits per day

The following example shows how to calculate the total number of visits on the website for the month of December, 2013.

db.web_traffic.aggregate(
[
  {
    $match: {visited_at: {$gte: new Date(2013, 11, 1), $lt: new Date(2014, 0, 1)}}
  },
  {
    $project: { _id: 0, day_of_month: {$dayOfMonth: "$visited_at"}}
  },
  {
    $group: { _id: {day_of_month: "$day_of_month"}, hits: {$sum: 1}}
  }, 
  { 
    $sort: {hits: -1}
  }
])

Let’s examine the pipeline one operation at a time.

  1. The first operation in the pipeline is $match, it selects the documents whose visited_on values are between December 1st to December 31st 2013.
  2. Next we have $project which reshapes the documents in the stream by adding, removing or changing one or more fields. In this case, we remove the _id field by setting it’s value to zero (_id is added by default so we have to remove it explicitly. The other fields are removed automatically if we don’t add them). We also add a new field day_of_month, which is the month property of visited_on. The month is extracted by using the $dayOfMonth expression.
  3. The $group operator groups documents together and calculates aggregated values. $group must be supplied a value for the _id field, this is expression the documents are going to be grouped by. In our example, we are grouping by day_of_month, the field added by the preceding $project operator. The $sum expression is used to keep a counter, named hits, for every day_of_month seen. In SQL, you could express this as: SELECT COUNT(*) as hits FROM web_traffic GROUP BY MONTH(visited_on).
  4.  Finally $sort orders the grouped documents based on hits in descending order (-1 means descending while 1 means ascending).

The output of the operation is like this:

{
   "result" : [
   {
     "_id" : {
       "day_of_month" : 7
     },
     "hits" : 40
   },
   {
     "_id" : {
       "day_of_month" : 16
     },
     "hits" : 39
   },
   .....
  {
     "_id" : {
       "day_of_month" : 20
     },
     "hits" : 23
   }
 ],
 "ok" : 1

The result shows us December 7th was the busiest day of the month (40 page views), whereas December 20 has been least busy (23 page views).

Calculating average, maximum and minimum response times

This next example calculated average, maximum and minimum values of the page response times (response_time_ms).

db.web_traffic.aggregate([
{
  $group:{
    _id:null,
    avg_response_time_ms: {$avg: "$response_time_ms"},
    max_response_time_ms: {$max: "$response_time_ms"},
    min_response_time_ms: {$min: "$response_time_ms"}
    }
  },
  {
  $project:{
    _id: 0,
    avg_response_time_s: {$divide: ["$avg_response_time_ms", 1000]},
    max_response_time_s: {$divide: ["$max_response_time_ms", 1000]},
    min_response_time_s: {$divide: ["$min_response_time_ms", 1000]}
    }
  }
])

Let’s quickly go through the operations in the pipeline:

  1. The first is a $group where we set the _id to null. This means we are not grouping by any field. We still have to calculate the average, maximum and minimum values of response times and we are doing that by applying $avg, $max and $min on response_time_ms field.
  2. The project operation suppresses the _id field, and at the same time divides the values by 1000 (with the $devide expression) to transforms the millisecond values into second.

Finding usage share of visitors’ browsers

In this example we will find out the unique browsers and platforms used by the site’s visitors using the User Agent value and calculate their percentage share.

var total = db.web_traffic.find({user_agent: {$exists:true}}).count();
var pipeline = [
  {
    $group: {_id: "$user_agent", count: {$sum: 1}}
  },
  {
    $project: {percentage: {$multiply: ["$count", 100 / total]}}
  },
  {
    $sort: {percentage: -1}
  }
];
db.web_traffic.aggregate(pipeline)
  1. First we find the total number of documents that have the user_agent field  populated, store the number as total.
  2. The $group in the pipeline groups the documents based on user_agent field (_id: $user_agent) and counting their numbers (count: {$sum: 1}).
  3. The $project calculates the percentage by diving count with total and multiplying with 100.
  4. The $sort sorts the result by percentage in descending order.

Finding weekly visit patterns of users

The aggregation pipeline in this example calculates how many times a user visits the website per weekday and what pages does he read. The user is represented by ip_address in this case. In real life, websites have more sophisticated methods for tracking unique users.

db.web_traffic.aggregate([
  {
    $project: {
      _id: 0,
      page: 1,
      ip_address: 1,
      visited_weekday: {$dayOfWeek: "$visited_at"}
    }
  },
  {
    $group: {
      _id: {visited_weekday: "$visited_weekday", ip_address: "$ip_address"},
      visits: {$sum: 1},
      pages_visited: {$addToSet: "$page"}
    }
  },
  {
    $sort: {"_id.visited_weekday": 1}
  }
])
  1. $project outputs documents with fields page, ip_address and calculated on named visited_weekday which is the day of week (1 is Sunday, 2 is Monday and so on) in visited_at timestamp.
  2. The documents are grouped on visited_weekday and ip_address, the total visits per IP is counted by $sum. Also, the value of page field is added to an array pages_visited by the expression $addToSet.
  3. Sort the documents on visited_weekday in ascending order.

The result of this aggregation looks like this:

{
  "result" : [
   {
       "_id" : {
       "visited_weekday" : 1,
       "ip_address" : "4.242.114.0"
     },
     "visits" : 19,
     "pages_visited" : [
       "/2012/03/26/tab-completion-for-fabric-tasks-on-os-x",
       "/2010/02/05/modifying-pdf-files-with-php",
       "/2011/11/24/random-snaps-from-thailand-tour",
       "/2011/01/29/on-programming-and-craftsmanship",
       "/2011/03/04/getting-started-with-pipvirtualenv",
       "/2010/05/07/moments-from-darjeeling",
       "/2011/01/02/blog-stats-2010/",
       "/2011/05/06/making-chrome-canary-the-default-browser-os-x"
     ]
   }
...

The first entry in the result set shows user with IP address 4.242.114.0 visited the site 19 times on Sunday (weekday 1) and visited the links in the pages_visited array.

Calculating duration of users’ visits on the website

In this last example, we will use a pipeline that calculates how much time (in hours) each user have spent on the site on December 7, 2013.

db.web_traffic.aggregate([
  {
    $match: {visited_at: {$gte: new Date(2013, 11, 7),
    $lt: new Date(2013, 11, 8)}}
  },
  {
    $sort: {visited_at: 1}
  },
  {
    $group: {
      _id: "$ip_address",
      first_visit: {$first: "$visited_at"},
      last_visit: {$last: "$visited_at"},
      total_visits: {$sum: 1}
    }
  },
  {
    $match: {
      total_visits: {$gt: 1}
    }
  },
  {
    $project: {
      duration_hours: {$subtract: [{$hour: "$last_visit"}, {$hour: "$first_visit"}]}
    }
  },
  {
    $sort: {duration_hours: -1}
  }
])
  1. First in the pipeline is a $match which selects the documents with visited_at on December 7 2013.
  2. Next, $sort sorts the matched documents on the timestamp in ascending order.
  3. $group groups the document on ip_address and counts total visits per IP address. It also save the first and the last value of visited_at for each group.
  4. The next $match gets filters out the documents with total_visit of 1 or less.
  5. $project calculates the duration of user’s visit by subtracting the first value of visited_at from the last value and expressing it in hours.
  6. Finally sorting the documents on duration in descending order.

Limitations

There are some limitations of the framework that you should be aware of:

  • The final output of the aggregation pipeline is stored in a single document in memory. This is in contrast with MapReduce which allows you store the result persistently in a collection. This means the size of the aggregation result should exceed the MongoDB document size limit, which is currently 16 megabytes.
  • If the operation consumes more the 10% of system RAM, it produces an error.
  • Aggregation pipeline cannot operate on these data types: Symbol, MinKey, MaxKey, DBRef, Code, CodeWScope.

Recommended Reading

TAB completion for fabric tasks [on OS X]

Found an excellent bash script on foobar.lu that enables tab completion on the fabric tasks defined in fabfile.py script (in case you don’t know what Fabric is, try Yuji Tomita’s article to get started with it). I just had to modify a little bit to make it work on OS X. I actually wanted to post it in their comments section, but it seems the comments are blocked on that site. So here it goes:

Install bash-completion with Homebrew

brew install bash-completion

You’ll need to add some lines in your ~/.bash_profile file to finish the installation. This will create an etc/bash_completion file under /usr/local directory. Now open the ~/.bash_profile file and add the following lines at the end:

function _fab_complete() {
    local cur
    if [ -f "fabfile.py" ]; then
        cur="${COMP_WORDS[COMP_CWORD]}"
        COMPREPLY=( $(compgen -W "$(fab -F short -l)" -- ${cur}) )
        return 0
    else
        # no fabfile.py found. Don't do anything.        
        return 1
    fi
}
if [ -f /usr/local/etc/bash_completion ] && ! shopt -oq posix; then
    . /usr/local/etc/bash_completion
    complete -o nospace -F _fab_complete fab
fi

Reload the file with source ~/.bash_profile command, navigate to your project directory where your Fabric script is located and try it out yourself.

Giveaway Contest : PHP and MongoDB Web Development Beginner’s Guide

I’m happy to announce that Packt Publishing has arranged a giveaway contest for the book PHP & MongoDB Web Development: Beginner’s Guide. All you have to do is participate in a small programming challenge. The winners of the contest will be receive a copy of the book each, free of charge!

The challenge is to build a standalone, re-useable HTTP session manager using PHP and MongoDB.

The details of the contest are available here.

We hope that this contest will spark your interest in building web apps using PHP and MongoDB. Happy Coding!

PHP and MongoDB Web Development Beginner’s Guide – Thoughts of a first-time author

PHP and MongoDB web development

Social networking doesn’t always make you procrastinate, sometimes it pays off! When @packtauthors tweeted that they were looking for someone to author a book on PHP and MongoDB, I made contact. Few weeks later I signed a contract for writing the book. And six months after that, I am pleased to announce that PHP and MongoDB Web Development Beginner’s Guide is published and out for sale!

In this post I intend to share a few words about the motivation behind the book and the journey of a first time author.

The Motivation

I’m a supporter of the idea the MongoDB can potentially be the new M in LAMP. The web application data storage requirements have changed a lot during the past 4-5 years. Instead of producing contents of their own, the most popular websites are hosting contents created by their users. These contents are diverse in nature and humongous in volume. Mapping the diverse data into a rigid data structure gets harder as the volume grows. This is where the ‘Flexible Schema’ nature of MongoDB fits really well. Also MongoDB is easy to learn, developers with relational database experience should find little trouble adapting to it. There is a lot of similarity between the underlying concepts of an RDBMS and MongoDB (think documents for rows, and collections for tables). Developers don’t need to wrestle with radical ideas such as column-oriented or graph theory based data structures as some other NoSQL databases require them to. Finally, it is open-source, freely available (Creative Commons License), supports multiple platforms (Windows/Linux/OS X), have great documentation and a very co-operative community, and plays nicely with PHP! All these have lead me to believe that in near future MongoDB will be where MySQL is right now, the de facto database for web application development (I would urge you to read Stephen O’Grady’s article which makes more persuasive arguments). And since PHP is the dominating language for web programming, writing a book on web development with PHP and MongoDB felt just right.

The intended audience for this book are web developers who are completely new to MongoDB. It focuses on application development with PHP and MongoDB rather than focusing only on MongoDB. The first few chapters will try to ease the reader into understanding MongoDB by building a simple web application (a blog) and handling HTTP sessions with MongoDB as the data back-end. In the next chapters he will learn to solve ‘interesting’ problems, such as storing real-time web analytics, hosting and serving media content from GridFS, use geospatial indexing to build location-aware web apps. He will also brainstorm about scenarios where MongoDB and MySQL can be used together as a hybrid data back-end.

The Inspiration

Scott Adams, the creator of the famous Dilbert comic strip, wrote an inspirational article on Wall Street Journal. I’m going to quote a few lines here:

“I succeeded as a cartoonist with negligible art talent, some basic writing skills, an ordinary sense of humor and a bit of experience in the business world. The ‘Dilbert’ comic is a combination of all four skills. The world has plenty of better artists, smarter writers, funnier humorists and more experienced business people. The rare part is that each of those modest skills is collected in one person. That’s how value is created.”

These words moved me. I like programming and I like writing, and although there are smarter programmers and better writers out there, by combining these two passions I could potentially produce something. Besides I had an amazing learning experience with MongoDB. I built an API analytics solution with MySQL which became difficult to handle as the volume of the data grew. I started playing with MongoDB as a potential alternative. A month later I moved the entire data from MySQL to a more solid and scalable solution based on MongoDB. I wanted to share this learning experience through a series of blog posts but lacked the personal discipline and commitment to do so. Being obligated a deliver a book within tight deadlines solved that problem!

I also must thank Nurul Ferdous, my friend and former colleague who is a published tech author himself. His guidance and influence has been instrumental.

The Journey

My journey as an author writing a book for the first time has been an exhaustive yet amazing one! I work in a tech startup, which naturally requires longer than usual hours and harder than usual problems to solve. I would come home late and tired, research on MongoDB topics, plan how to deliver the message to the reader, write code, test and debug the code, write the content on a text editor, fight with Microsoft Word so the content has proper formatting as required by the publisher. Then on weekends I would revise and rewrite most of what I have done over the week and hustle to make the deadline. Nevertheless it all had been a rewarding experience.

In the rewrite phase I had a lot of help from the technical reviewers – Sam Millman, Sigert De Vries, Vidyasagar N V and Nurul Ferdous. They corrected my errors, showed me what more could be added to the content and what should be gotten rid off, helped me communicate complicated topics to readers in a clearer way. I convey my sincere appreciations to them!

Time to end this lengthy blog post. I hope you find this book enjoyable and use it to build some really cool PHP-MongoDB apps! I will then consider my endeavor to be a success.

Learn core Python from Stackoverflow

The other day I needed to brush up my knowledge on Python decorators, for implementing a profiling hook for a Django REST API I’ve been working on. I googled around and landed on a question titled Understanding Python Decorators on Stackoverflow.com. The answer given by the user e-satis is an excellent introduction to decorators in Python, although it has not been selected as the official answer. Go ahead an give it read if you have done some Python coding.

This got me thinking. This guy does not care about whether or not his answer gets selected before others (so that he can bag some reputation). He took his time and explained what decorators are in a well thought-out, step-by-step manner. The end result is an answer worthy of being placed in a beginner level Python book as a chapter on decorators! So there must be other answers as awesome as this one, given by users as knowledgable as e-satis! I set out to find few more great StackOverflow threads on core Python programming, had them bookmarked for future reads, and now sharing them on this blog post. You should read each of them at least once to make your knowledge on core Python more solid.

Understanding Python decorators

As I mentioned above, great introduction to decorators in Python.

What is a meta-class in Python

Another great answer from e-satis (later turned into a community wiki). Explains what meta-classes are in Python and why are they so useful.

How to sort a list of objects in Python, based on an attribute of the objects?

The answer to this question often comes in handy!

Understanding Python super() and init methods

A short yet useful answer explaining what good is super() for. You should also read the article Pythons’s super() considered super! along with this.

What is the relationship between __getattr__ and getattr?

I always wondered; and Kimvais provided the answer! Also checkout Alex’s answer in the same thread.

The Python yield keyword explained

Another gem from e-satis. Introduced the concept of iterators and generators in Python and goes on to explain the use of yield statement.

Single quotes vs. double quotes in Python

The stylistic reasons for choosing double quotes or single quotes when coding in Python. Worth a read.

Python Compilation/Interpretation process

Very detailed answer explaining Python compilation/interpretation process to programmers not coming comp-sci background; something every Pythonist should know.

Python progression path- from apprentice to guru

Last but not the least! A great thread showing what road a Python beginner should take to achieve mastery.

If you know about any other great StackOverflow posts, please do share them on the comment section. I’ll update my list. The posts must be about Python (core Python more specifically).

Getting started with pip/virtualenv

If you find yourself in a situation where you have to work with multiple Django projects on the same system, where each one requires a specific version of Django(and some other libraries), or even specific version of Python(2.5/2.6..), the combination of virtualenv and pip will make your life easier.

virtualenv is tool to create isolated Python environments.

pip is a Python package installer, much like easy_install. It’s complementary with virtualenv; you’ll use pip to install, upgrade, or uninstall packages in isolated virtualenvs.

In this post, I’ll show a simple workflow that will give you the basics on how to use these tools to manage multiple Django projects.

Installation

First, get these tools installed on your machine. If you have easy_install set up, you can install pip by just issuing the following command

sudo easy_intsall pip

Now install virtualenv, using pip itself

sudo pip install virtualenv

Creating virtualenvs

Now that you have them installed, let’s try creating some virtualenvs. It’s good to have a specific location on your machine where you would have all the virtualenvs. I personally like to have them under ~/.virtualenvs directory.

Create the directory

mkdir -p ~/.virtualenvs

Navigate into it

cd ~/.virtualenvs

Create a virtualenv

virtualenv -p python2.5 --no-site-packages projectA

The ‘-p’ option specifies which version of the Python interpreter you want to use for the virtualenv(In case you have multiple in your system). If not specified, the default interpreter will be used.

The ‘–no-site-packages’ option means do NOT inherit any packages from /usr/lib/python2.5/site-packages (or wherever your global site-packages directory is). Use this if you don’t want depend on the global packages and want more isolation.

Finally ‘projectA’ is the name of the virtualenv, named after the project itself(not a rule, just a good convention).

You’ll see the directory projectA created once the command prompt returns. This is the virtualenv, feel free to explore it.

Activating/Deactivating the virtualenv

Once pip have finished downloading and installing the packages, hopefully without any glitches, activate the virtualenv

source ~/virtualenvs/projectA/bin/activate

You should see “(projectA)” in your command prompt.

To deactivate it type ‘deactivate’ in your prompt and hit enter. The “(projectA)” should disappear.

So now you have an isolated and functional environment for your Django project. You should go ahead and create a new virtualenv with more recent versions of Python/Django as your homework.

I hope you’ve got the basics properly. Check out the online docs to learn more advanced usages of these tools.

In the next post, I’ll talk about virtualenvwrapper, a tool that makes managing multiple virtualenvs a little easier.

On Programming and Craftsmanship

I’ve just finished reading the article “Programming is not Craftsmanship” and couldn’t help but share my own views on this topic. If you haven’t read the article, I urge you to do so right away.

I agree with the author’s view in summary. As a programmer, you may take pride in your knowledge and skills, but the hard truth is the real world hardly distinguishes you from a plumber(may be a more literate plumber, if I dare say). They don’t care if the pipes are laid out in an elegant manner, or not, as long as the water is running. They focus on results, neither the process nor the tools. I’ve tried to share this point of view in the feedback section of every “Why programming language X rocks and Y sucks” blog post I’ve encountered.

But let’s consider the hypothetical situation where you are an entrepreneur, who has an idea of an amazing software. You hire some smart engineers. They build it out and the sales guys close deals. Customers are satisfied, and they ask for more features. The programmers code some more, you bill your customers for their work. Everyone’s happy.

Now a few years later, you’re techies are bored. They are looking for new challenges, as any good hacker would do. You recruit some smart guys to take over their work. They enter a painful transition period. The greenhorns have to learn the intricate details of your feature bloated software their predecessors have built; so they can fix, change or add stuff as demanded by the customers.

And here’s where the architecture of the software comes into play. If your early programmers have laid out a solid groundwork, keeping in mind the readability of the code and the extensibility of the software itself, the new guys will have less of a hard time getting things done. On the other hand, if the programmers have never cared about those things, and  just produced code to meet the deadlines, the new guys will go through hell figuring out which part of the software does what and how do they all come together. You’re customers are still waiting for those bugs to be fixed, those features to be added. They get impatient at the delay, and displeased at the higher than usual billing hours upon delivery. At this point you may think having a good “Documentation” helps, but it hardly does. If the programmers never had the time or will to build the software right, odds are against the fact that would have so for documenting it.

I’ve seen(and am continuing to see) it happen. Management and customers were happy playing with the pretty UI. But when I popped the hood out of curiosity, I was surprised(to say the least) to see, along with other things, the spiderweb-like intricacies of the if-else branches, and left with sympathy for the poor soul who would have to ‘maintain’ this codebase(praying that I wouldn’t be the one).

I’ll draw my conclusion with the personal opinion, that a successful programmer(or product manager) knows how to strike a balance between the two extremes. Being the prima donna programmer who whines about the elegance of code should be discouraged by all means. But the coder who pays the slightest attention to quality, must be avoided. For a software shop whose sole business plan is to “build, ship and forget”, the later breed may prove profitable. But not for the enterprise who seeks to deliver innovative and useful products.

(Joel Spolsky, one of my favorite bloggers, gave a presentation on a similar topic. You can watch it here.)