After publishing my first paper, I needed to keep track of how many hits it received. Most web log analyzers out there were overkill, or didn't give per-file reports, and I didn't want to have to embed JavaScript in every page to use Google Analytics or somesuch. I had access to the web server log files, but there was no analyzer written in Python. So, half-procrastinating and out of interest, I wrote one.
httpfilehits is a set of modules that will take a raw web server log file and return to you a report of how many hits each file has received, further breaking it down by IP address, and attaching some useful data such as the hostname, geoip info, and access dates. Here's a fictitious snippet:
/SomeObscureDirectory/ModeratelyPopularFile.pdf Unique IP Hits: 74 175.19.208.190 US Palo Alto 1 ajdgl33532-da201.atlanta.jm.com 2010-10-23 83.129.13.215 PL 4 sjsan33.tak.pnet.pl 2010-11-05 2010-09-02 2010-08-29(2) 103.20.174.157 FR Paris 4 gros-espresso.petit-cafe.fr 2010-06-22 2010-06-11 2010-06-08 14.27.19.194 US 16 2010-11-01(2) 2010-10-31(8) 2010-10-30(6) ... (and so on)
In order: IP address, country, city, number of hits, hostname, dates of hits (duplicates are condensed).
A similar log is output separately for entries in the web server logs which refer to non-existent files or to directories. There is a human-readable file translation table which allows you to redirect statistics from one file to another (e.g.: if it was moved or renamed) or strike out reporting on a file altogether (e.g.: bogus accesses or files not made public).
All the hostname and geoip data are kept in caches (also human-readable) to *greatly* accelerate lookups. For example, my 2.3GHz Core 2 Duo processes a 5 MB, 33,000 line log file in about three seconds once the caches are refreshed. It takes a good chunk of an hour otherwise, all spent doing hostname and geoip lookups.
Internally, there is enough information to output more sophisticated reports, such as traffic per file per day, per country, etc... Implementing such reports should not be too hard as all the internal information is held in straightforward lists of dictionaries, and some filters are already provided.
This software is licensed under the Creative Commons Attribution-ShareAlike license.