Where, when and how Apache’s Commits are made

Six months ago, I started my master’s degree where I’m researching about software engineering and mining of code repositories. In the next months I pretend to, besides write about C#, Javascript and programming in general, also write about subjects that I’m researching, tools that I’m developing, and papers that I’m reading. In this post, I will talk about the results that I had through mining and gathering information from Apache Httpd repository.



Introduction

When developing a software, developers are always adding, changing and removing software artifacts. These software artifacts can be code, documentation, config files and so on. To manage these changes, developers use a VCS (version control system), good VCS examples are CVS, SVN, GIT and Mercurial. These VCS and the changes they manage, end up being a important information source about a software and everything it’s related. Through mining we can answer a lot of questions about the software that are being developed:

  • How many developers are working in a software?
  • Where they come from?
  • What time they work in the project?
  • Who are the commiters who work in each piece of the software?
  • Who introduces more bugs?
  • Who produces the better code?



About Apache Httpd

The Apache HTTP Server Project is an effort to develop and maintain an open-source HTTP server for modern operating systems including UNIX and Windows NT. The goal of this project is to provide a secure, efficient and extensible server that provides HTTP services in sync with the current HTTP standards.

Apache Httpd have been developed since 1996 and, today, is in version 2.4.6 released in July, 2013. During this time, more than 100 developers made more than 55k commits. Due this size, Apache Httpd is constantly studied by computer scientists being target for various studies in academy. Httpd’s artifacts are managed in a SVN(Subversion) repository and can be found in this link, you can also find more information about the project here.



The Research

In this research, I was interested to gather information about the Apache Httpd Developers. The following questions were answered:

  • Where developers come from?
  • When, time and weekday, developers make commits?
  • Which file types are edited?



Mining Apache Httpd Repository – Getting Commits

I didn’t know nothing about Python, so I chose python (I’m not crazy, I was just trying to add one more tool to my belt) to extract data from the SVN Repository. Honestly, I don’t know if there are others, but I found a very good tool to extract data from SVN Repositories called PySVN. I extracted all the data that I needed using the following code:

"""Documentation Link: http://pysvn.tigris.org/docs/pysvn_prog_ref.html#pysvn_client_log"""
import pysvn

class SvnService(object):
    """docstring for SvnService"""
    def __init__(self, repository_url):
        self.repository_url = repository_url

    def get_info(self):
        client = pysvn.Client()    
        data = client.log(self.repository_url, discover_changed_paths=True)
        return data



Mining Apache Httpd Repository – Getting Geolocations

When getting commits from a repository, we don’t have any information about a developer, besides his login. One of my goals was draw a commit map with commits distributed by location, to get these information I grouped all the commits by developer’s login and started to search manually their geolocations. The Apache Httpd project has a web page with some developers profile that includes their address, you can find this information here. After this step, I already had the geolocation information for a lot of developers, but I was still missing some of them. The Apache Foundation has another page where I can find a developer’s name from his login. Here is the page. At this moment, I had all developers name, so I started to google them and for my happiness most of them has a online profile(Blog, Github, personal site and so on) with their address. The last step was to get their latitude and longitude through google maps API and their address.

// Request
"http://maps.googleapis.com/maps/api/geocode/json?address=Brazil&sensor=false"

// Response
{
   "results" : [
      {
         // ...
            "location" : {
               "lat" : -14.235004,
               "lng" : -51.92528
            }
         // ...   
       }
   ],
   "status" : "OK"
}



Mining Apache Httpd Repository – Adjusting the time zone

Two of my other stats depends on commit’s time, the time of each commits for an obvious reason, and the weekday. The relation between time zones and weekdays are a little trickier, if a developer make a commit around midnight and we adjust the commit time accordingly to his time zone, it also can change the commit’s date and of course changing the commit’s weekday. To adjust the time zone, they are originally in UTC, I used a google API again, the Google Time Zone API, it’s use is very simple, making request with a location (latitude and longitude), the api returns a json with information about the location’s time zone.

// Request
"https://maps.googleapis.com/maps/api/timezone/json?location=39.6034810,-119.6822510&timestamp=1331161200&sensor=false"

// Response
{
   "dstOffset" : 0.0,
   "rawOffset" : -28800.0,
   "status" : "OK",
   "timeZoneId" : "America/Los_Angeles",
   "timeZoneName" : "Pacific Standard Time"
}



Results



Commits By Location

mapa-aberto

mapa-fechado

Most of the Apache’s Commits comes from USA, England and Germany. Analyzing only the Top 20 committers, 12 come from USA, 4 from Germany,
2 from England, 1 from Denmark and 1 from Canada. A interesting point here is the Research Triangle Park who contributed a lot to Apache Httpd with 7 committers ( 3 of them in the Top 20).

Top 20 Committers and their Locations

1. William A. Rowe Jr. – Illinois, USA
2. Jim Jagielski – Maryland, USA
3. André L. Malo – Germany
4. Jeff Trawick – North Carolina, USA
5. Rich Bowen – Kentucky, USA
6. Stefan Fritsch – Germany
7. Rüdiger Plüm – Germany
8. Dean Gaudet – California, USA
9. Graham Leggett – England
10. Ryan Bloom – California, USA
11. Justin Erenkrantz – California, USA
12. Joe Orton – England
13. Joe Schaefer – Florida, USA
14. Daniel Gruno – Denmark
15. Joshua Slive – Canada
16. Ken Coar – North Carolina, USA
17. Doug MacEachern – California, USA
18. Bill Stoddard – North Carolina, USA
19. Ralf S. Engelschall – Germany
20. Roy T. Fielding – California, USA



Commits By Time and By Weekday

commits-weekday
commits-timeoftheday

Most of the commits were made in work hours and in work days. It can suggest that committers made these commits while working in their jobs or in their research (like the developers from Research Triangle Park).



By Date

commits-time

Analyzing commits through years, we can see that Apache Httpd is stable, it didn’t have a boom (like Rails had in the last years), number of commits through the years, is, in average, almost the same.



By File Extensions

bubble-chart

As I expected, in a C project, most of the commits come from C files, ‘.c’ and ‘.h’. The NotSpecified extension, actually is files without extensions, usually text files. There is also a lot of documentation files, written in html files and their respective translations ‘.html.en’, ‘.html.fr’, ‘.html.de’ and so on.



Top 5 Committers

As curiosity, below is the same graphs by committer.



William A. Rowe Jr.

mapa-aberto

mapa-fechado

weekday

timeoftheday

bytime

files



Jim Jagielski

mapa-aberto

mapa-fechado

jim-week-day

jim-timeoftheday

jim-bydate

jim-byfile



André L. Malo

nd-mapa-aberto

nd-mapa-fechado

nd-weekday

nd-timeoftheday

nd-date

nd-fileextension



Jeff Trawick

trawick-mapa-aberto

trawick-mapa-fechado

trawick-weekday

trawick-timeoftheday

trawick-date

trawick-filextension



Rich Bowen

rbowen-mapaaberto

rbowen-mapafechado

rbowen-weekday

rbowen-timeoftheday

rbowen-date

rbowen-filextension

Leave a Reply

Your email address will not be published. Required fields are marked *