Analyze and visualize nginx log data with parsible and plogx

Monday, September 15, 2014

Recently I moved this blog from WordPress to my own server and started serving my articles with mynt, a static site generator. Besides several benefits, there is a downside: no statistics about my blog’s traffic. I thought about using Google Analytics or Piwik, but I’m not interested in either tracking the visitors of my blog or knowing too much details like their eye color. I just need to know, which blog post has how much unique visitors (per day, per month, overall). After a quick research I realized, that there are either services, that require a fee or self-hosted solutions mostly written in PHP. Because I’d like to have a self-hosted python tool, I decided to write some parts of it my own.

Parsible

While I did some research, I stumbled upon Parsible. It’s a tool written in Python to parse logs in real time. Because it is highly customizable by plug ins, it is possible to parse nginx webserver logs and further process this data as you want. So at first I forked Parsible, customized the nginx parser and wrote a mongodb processor, which stores each log item into a mongodb database.

plogx

Now that the log data is in a database, a solution is needed to analyze and visualize these log items. I decided to write my own tool called plogx. Basically it aggregates all log data, adds filters and visualize statistics about visitors of a website. The statistics are being served to the browser by Flask, a lightweight microframework.

plogx in action

Installation

Four things are needed:

  • nginx log file
  • mongodb
  • parsible
  • plogx

mongodb

To install mongodb, follow the instructions on the mongodb website for your system. For Ubuntu servers this howto is recommended: http://docs.mongodb.org/manual/tutorial/install-mongodb-on-ubuntu/

Parsible

To install Parsible you can clone my fork of it:

git clone https://github.com/pedesen/parsible.git

Parsible runs as a service using supervisor. Here’s the script located in /etc/supervisor/conf.d/parsible.conf:

[program:parsible]
command = python /path/to/parsible/parsible.py --log-file /path/to/your/logfile.log \
  --pid-file /tmp/parsible.pid --parser parse_nginx

autostart=true
autorestart=true
stopsignal=QUIT

sterr_logfile=/var/log/supervisor/parsible_err.log
stdout_logfile=/var/log/supervisor/parsible_out.log

Please don’t forget to change the path to the Parsible script and the location of the nginx log file above you want to parse (usually located in /var/log/ngnix/).

As mentioned in the Parsible instructions you have to make an adjustment in the logrotate config file, if you are using logrotate. In my case I had to add the following bold line to the config file for my log file (located in /etc/logrotate.d/):

prerotate
    if [ -d /etc/logrotate.d/httpd-prerotate ]; then \
        run-parts /etc/logrotate.d/httpd-prerotate; \
    fi; \
endscript
postrotate
    [ ! -f /var/run/nginx.pid ] || kill -USR1 `cat /var/run/nginx.pid`
    [ ! -f /tmp/parsible.pid ] || kill -USR1 `cat /tmp/parsible.pid`
endscript

The Parsible supervisor script can be started like this and should be started automatically at system startup:

sudo supervisorctl parsible start

If all went as expected, Parsible should start to parse your log files in real time and store them as documents in the mongodb-database log_db in a collection named log_items.

plogx

To install plogx, first create a Python virtual environment. It will also install the dependencies flask and flask-pymongo:

git clone https://github.com/pedesen/plogx.git
cd plogx
virtualenv env
source env/bin/activate
pip install -r requirements
deactivate

To test plogx, you can use the bultin flask development server. Please don’t use this in production! Always serve the flask app with uwsgi and nginx or Apache for example and at least secure it with basic auth:

cd path/to/plogx
source env/bin/activate
python plogx/app.py

This will start running plogx on port 5000. If all went well, you can browse to the webinterface at http://127.0.0.1:5000. If you’d like to configure plogx, copy config_example.py to config.py in the plogx directory and make some adjustments. Here’s an example of config.py which excludes some unwanted visitors from the stats:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# List of paths to ignore (useful for css and js files)
excluded_paths = [
    "/favicon.ico",
    "/feed.xml"
]

# List of IPs to ignore if you want to blacklist some requests
excluded_ips = []

# Ignore log items, which contain one of these strings in the client field
# Note that you can use regex expressions here Example to ignore Googlebots:
# excluded_clients = [".*Googlebot.*",]
excluded_clients = [
    ".*Googlebot.*",
    ".*Twitterbot.*",
    ".*bingbot.*",
]

Maybe I will simplify the process of installation and configuration with tools like Docker. Any advice is appreciated, just write a comment below!