Assignment A: Software evolution analysis

In this assignment, you will have to perform several typical maintenance and reverse-engineering analyses on a given software repository. The analyses will follow the procedures and material described during the lecture. Overall, the goal of the assignment can be summarized as:

Given a software repository, perform several analyses in order to assess the maintainability, modularity, complexity, and quality of the software in the repository, as well as the development process.

The several steps of this assignment are described below. The steps are arranged in increasing order of difficulty. The activities to perform include directly studying the software in the repository by manual inspection, running specific analysis tools on (parts of) the software, and discussing in writing on several technical topics involving the analysis process.

Deliverable

The first deliverable for this assignment will be a report in which every step of the assignment has to be described. You have to describe the results of the performed analysis, support them by either textual results or screenshots from your tools usage, and comment the findings. The level-of-detail of the report has to be high enough so that one can reproduce your findings, and also understand how you obtained and justified your conclusions.

The second deliverable for this assignment is an electronic medium (CD/DVD/USB stick) on which all the relevant data mined and analyzed during the assignment has to be stored, including the final report (PDF). No sending by mail is accepted (due to problems with very large attachments).

Tools

Within your analysis, you will use a number of software tools. Document and discuss in detail in your report the limitations that you discover when using these tools to perform your tasks. Where applicable, describe how you would improve these tools to better support your tasks, if you were asked to do so.

Step 1: Basic repository investigation

Aim:

The aim of this step is for you to get familiar with the repository itself, and the basic tools used in the industry to access and browse repositories. The repository used as example below is KDEOffice, which contains the Office application of the KDE Desktop Environment. However, as indicated further, you can study several other repositories instead in your assignment.

General background information on the KDE KOffice project is available at www.kde.org. The evolution of this project is stored within a Subversion (SVN) repository. The address of the repository is:

 anonsvn.kde.org/home/kde/trunk/koffice

Anonymous access to the repository is permitted.

The aim of the first step is for you to connect and browse the repository using several easy-to-use tools. First, use the online web repository browser to get a general overview of the project – or as a failsafe solution in case the following more complex tools would not work right away. This repository browser is available via: http://websvn.kde.org/trunk/KDE. Note, however, that this is a web browser for the entire KDE project, whereas we are interested only in the KOffice module.

Next, install a Subversion client in order to download, or check-out, the last version of the KOffice application.

You can choose to use either a command-line Subversion client (typically called svn) or, easier, the shell-integrated Tortoise SVN tool. You can download a command-line svn client from here and the shell-integrated client from here. Again, note that you do not have to check out the entire KDE tree, but only the KOffice application. For example, if you use the command-line svn, you would need to do something like

 svn co svn://anonsvn.kde.org/home/kde/trunk/koffice

After you have installed a Subversion client and managed to connect to the repository, use the built-in functions of your client (whether accessible via a GUI or via the command-line) to answer a number of simple questions about the repository:

How many versions are in the repository? When was the first one committed? When was the last one committed?
Which are the top-level largest folders in the repository (i.e. containing the most files)? Which, of these folders, are the ones where the most source code is located? To perform this assessment, consider the last checked in version of the software
Which are the three most active developers (i.e. having performed most of the commits) in the first half of the project (from its inception to the midpoint of its evolution)? Which are the three most active developers for the last half of the project?

For every question above, document in detail how you have obtained your answer (e.g. list the command-line commands issued to get the data and/or the GUI operations executed in the GUI client to obtain the relevant information). Also, detail how much time it took you to answer the above questions, starting from the moment you could establish the connection with the repository, up to the moment you got your answer.

Step 2: Getting a first visual overview

Aim:

The aim of this step is to perform several more advanced analyses on the repository. For this, you will use a more complex analysis tool for software repositories, called Solid Trend Analyzer (SolidTA). A demo version of this tool is available at the location indicated in the Tools section. The tool comes with an executable installer, and should run on a recent version of Windows. The tool is also accompanied by a user manual.

First, install the tool on your machine using the executable installer. Note that you have the choice of two installers, one being approximately half the size of the other. Install the larger one if you have enough space on your hard drive (see the exact requirements when running the installer, they are about 120 MB), or the smaller one if you have less space or a slower Internet connection to download the installer. The actual software is the same in both versions. What differs is the number of already analyzed examples included together with the analyzer.

Next, get familiar with the main concepts and ways of working of the SolidTA tool. For this, study the user manual included with the software.

Important note: The user manual coming with the installer is not the latest version! For an updated version, use the Word document called “SolidTA User Manual.doc” located in the same web folder as the executable installers. In particular, study the first two chapters and the chapter on usage examples from the manual. Note that all the examples described in the user manual already come pre-cached in the tool’s example databases included with the installers, so you can try them right away, without having first to connect to a remote repository.

Next, use the evolution view and its various sorting modes to answer the following questions:

Which are the stable development periods of the project? (i.e. periods within which the code base does not change and/or grow significantly)
Which are the moments when the code base undergoes intense changes? You can define these as either moments when the code size grows significantly, but also moments when many commits are executed at the same or closely-related time moments.
Is the code currently in a stable state? A stable state implies a relatively low set of changes done recently, and also a stabilized growth rate in terms of number of files.

Step 3: Authors analysis

Aim:

The aim of this step is to perform several analyses regarding the authors of the considered project, i.e. persons who commit changes in the repository. For these analyses, use again the evolution view and the different types of metrics offered by the SolidTA tool (authors, file type, code size, etc). In this step, you should answer the following questions:

Which are the ‘main contributors’ of the project, i.e. authors responsible for large amounts of files in the repository? Do you see there being a ‘chief developer’ who is the main person in charge? Or is there a transition (shift) from one chief developer for a period, to another chief developer for another period?
Having identified the ‘chief developer’, i.e. the person whom you regard as most important in the project, assume now this person has to quit the project. Who, of the remaining developers, is the one you find the most qualified to take over the work of the chief developer? To do this analysis, consider that a person Y knows best about the work of a person X when Y has modified most of the files that X has modified too.
Is there a correlation between the type of files and the developers responsible? For instance, could you say, with reasonable certainty, that “developer X” is the chief responsible for “files of type Y”? If so, support this by means of an appropriate analysis and snapshots. If not, support your negative decision also by an analysis and snapshots.
Is there a correlation between the location of files and the developers responsible? For instance, could you say, with reasonable certainty, that “developer X” is the chief responsible for “files in directory Y”? If so, list a number of (large) directories and their corresponding responsible developers. If not, support your negative decision also by an analysis and snapshots.

Step 4: Code size analysis

Aim:

The aim of this step is to perform several analyses regarding the code size in the considered project. For these analyses, use again the evolution view and the code size metric. Proceed as follows:

First, use the “Lines of text counter” calculator to compute the lines-of-code (LOC) metric for the source code files in the repository. This should generate the ‘Code size’ metric. Be forewarned, this plugin might take quite some time to execute, as it needs to actually access the contents of the source files. After the ‘Code size’ metric is available, answer the following questions.
How is the size of code files evolving in the project? Are source code files growing or shrinking on the average? Which are the fastest growing files? Which are the files that shrink the most?
Group all files, in the evolution view, based on the ‘Code size’ attribute (right-click in the evolution view, and then ‘Group selected’). How much is source code, in terms of percents, from the total project size (in terms of files)?

Step 5: Complexity analysis

Aim:

The aim of this step is to perform a complexity analysis on the considered project. For this analysis, you shall use the McCabe (cyclomatic) complexity metric. The McCabe metric characterizes functions based on the number of independent paths through the code. Highly complex functions indicate potential maintenance hot-spots and bottlenecks. What is even more important than a high complexity value, is a trend showing an increasing complexity of (a part of) a system.

To compute the complexity metric, proceed as follows:

First, use the tool’s views (e.g. file browser, sort files by type, etc) to locate one of the large(st) folders of the project which contains mainly source code written in C or C++. A possible way to do this is as follows. Enable the “File type” and “Folders” metric (in the Metrics tab), sort files in the main view on file type, optionally set the colors of the targeted (.c,.cxx,.cpp,.cc,.h) file types to some particular color (e.g. red), and toggle between the file type and folders metric using the preset controller. This will let you see in which folders the most source files are located. To ease your work further on, you can select the source files (shift-click on the file range in the main window) and create a selection. The selection will appear in the ‘Available file selections’ window (atop the tree view). You can rename this selection as desired, e.g. “Source files” by right-clicking on it.

Use the Projects tab functions to bring the contents of the selected sources folder(s). As explained above, please note that this operation may take quite some time.

Use the “CCCC metric” calculator in the “Calculators” tab to compute the McCabe’s complexity metric on the selected source code folder. As explained above, please note that the complexity computation may take quite some time.

Select the “McCabe’s complexity” metric from the Metrics tab, produced by the calculator in the previous step, and visualize it in the main view. This will produce a view showing the evolution of the complexity on the selected folder(s). You will note several gray files. These are files on which the metric cannot be computed (they are not source files) or the metric calculator failed the computation (for technical reasons).

After you have computed the complexity metric, use the sorting, selection and coloring options in the main view to answer the following questions:

Which are the most complex source files in the entire project?
Are there files on which the complexity decreases significantly in time? Which are these?
Are there files in which the complexity increases significantly in time? Which are these?
For the above files (high complexity and/or complexity rate of variation), are these highly active files (with many changes), or not?
Is there some correlation between the highly complex files and the file size (measured in lines of code, as done in Step 4)? Can you find a direct correlation? Or an inverse correlation? Or is there no visible correlation to be found?

Step 6: Dependency analysis – explore a scenario

Aim:

The aim of this step is to write a short essay (5..7 pages) based on the insight accumulated during the lecture and practical assignment. The topic of the essay is simple:

Imagine you are the designer of a software evolution analysis tool such as the Solid Trend Analyzer, and you have to integrate a dependency evolution analysis within its set of functionalities. Consider, for instance, a repository storing software source code. For every single version vi of the source code, you can extract its call graph Gi. The question is: how to visualize the evolution of the call graphs G1, G2, G3, … Gn? The question is synthetically captured by the image below.

In the essay, address the following topics:

Dependency data modeling:

In a code source repository, several types of dependency graphs exist, such as: a call graph (nodes=functions definitions; edges=function calls); a class inheritance graph (nodes=class declarations; edges=class inheritance relations); a containment graph (nodes=software entities, e.g. methods, classes, namespaces, files, folders, packages; edges=containment relations, e.g. method-in-class-in-namespace-in-file, etc); a build dependency graph (nodes=files in the repository; edges=compilation dependencies between files, e.g. if file X changes, then files Y and Z need to be rebuilt). Note that some of these graphs are cyclic, others acyclic, and others are actually trees.

Prior to any analysis and/or visualization, a data model has to be defined in which the set of graphs Gi has to be stored. Besides the graphs Gi, we must also store correspondences between their nodes and edges, i.e. describe in some way how a data element in version i maps to (corresponds to) a data element in version i+1.

Propose and describe a data model, in terms of either an entity-relationship diagram or a UML diagram, in which we can efficiently and effectively store the above dependency and correspondence information.

What are the different trade-offs of your model?
What decisions will you take for an efficient implementation of your data model?
Imagine that a new version gets committed in the software repository. Describe the update operations needed to your dependency data model to make it up-to-date with the additional information in the new version.

Dependency data visualization:

After we have a data model in which to store the evolution of the dependency data, we must provide effective ways to perform analyses on this model. An essential ingredient for this is the ability to visualize the evolution of the dependency dataset. This is a challenging problem, since the dependency dataset is highly complex, intertwined, and also can be quite large.

Propose and describe a set of methods (techniques) that can be implemented to visualize the evolving dependency data set. Consider the following aspects:

Describe a possible way to visualize the evolution of one of the considered dependency graphs (call graph, inheritance graph, containment graph, or build graph) on the same 2D layout as the Solid Trend Analyzer main view. Recall this layout uses the x axis for time and the y axis for files, whereby every file version is drawn as a rectangle. The question is: how to visualize the evolution of a dependency graph atop of this layout? Discuss your solutions to several issues such as: scalability (the proposed visualization should work for real-size repositories such as the one you studied during the previous assignment points); and limited cluttering (the proposed visualization should produce drawings which are reasonably easy to follow, even in the case of a complex, large dataset).
The different graph types discussed above (call, inheritance, containment, and build) exhibit some clear differences. These may lead to different decisions in the design of an evolution visualization. Discuss the following two aspects:
Which graph is of which of the following types: tree; directed acyclic graph; general (cyclic) graph.
A tree is very different from a general cyclic graph. How can you use the knowledge of a particular graph structure in the design of the dependency evolution visualization? Describe in which way you could take advantage in the visualization design if you knew your dependency graph is a tree rather than a general graph.

Reading material for Step 6

Before writing the essay, read the following related material on visualizing structural evolution of software code:

F. Chevalier, D. Auber, A. Telea: Structural Analysis and Visualization of C++ Code Evolution using Syntax Trees
C. Collberg, S. Kobourov, J. Nagra, J. Pitts, K. Wampler:A System for Graph-Based Visualization of the Evolution of Software

Example reports

To support you in executing assignment A, three examples of good reports written by past students who took the course are given below. These give a good idea of the size, complexity, level-of-detail, and presentation style expected from the report. These reports were graded with marks equal to or above 9.

Example report 1

Example report 2

Example report 3

Important: The above reports are provided as examples only. If you use the same repositories in your assignment, make absolutely sure that the findings you discuss in your report are yours only. Material taken from the above two example reports, and reused in one's own report, will be considered plagiarism and treated as such.

Software Maintenance and Evolution

Lectures

Assignment

Assignment A: Software evolution analysis

Deliverable

Tools

Step 1: Basic repository investigation

Aim:

Step 2: Getting a first visual overview

Aim:

Step 3: Authors analysis

Aim:

Step 4: Code size analysis

Aim:

Step 5: Complexity analysis

Aim:

Step 6: Dependency analysis – explore a scenario

Aim:

Dependency data modeling:

Dependency data visualization:

Reading material for Step 6

Example reports