Concept Analysis for Module Restructuring by nyut545e2


									IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,               VOL. 27,   NO. 4,   APRIL 2001                                                                     351

       Concept Analysis for Module Restructuring
                                                                    Paolo Tonella

       AbstractÐLow coupling between modules and high cohesion inside each module are the key features of good software design. This is
       obtained by encapsulating the details about the internal structure of data and exporting only public functions with a clean interface. The
       only native support to encapsulation offered by procedural programming languages, such as C, is the possibility to limit the visibility of
       entities at the file level. Thus, modular decomposition is achieved by assigning functions and data structures to different files. This
       paper proposes a new approach to using concept analysis for module restructuring, based on the computation of extended concept
       subpartitions. Alternative modularizations, characterized by high cohesion around the internal structures that are being manipulated,
       can be determined by such a method. To assess the quality of the restructured modules, the trade-off between encapsulation
       violations and decomposition is considered and proper measures for both factors are defined. Furthermore, the cost of restructuring is
       evaluated through a measure of distance between original and new modularizations. Concept subpartitions were determined for a test
       suite of 20 programs of variable size, 10 public domain and 10 industrial applications. On the resulting module candidates, the trade-off
       between encapsulation and decomposition was measured, together with an estimate of the cost of restructuring. Moreover, the ability
       of concept analysis to determine meaningful modularizations was assessed in two ways. First, programs without encapsulation
       violations were used as oracles, assuming the absence of violations as an indicator of careful decomposition. Second, the suggested
       restructuring interventions were actually implemented in some case studies to evaluate the feasibility of restructuring and to deeply
       investigate the code organization before and after the intervention. Concept analysis was experienced to be a powerful tool supporting
       module restructuring.

       Index TermsÐConcept analysis, modularization, encapsulation, abstract data type, legacy systems, reengineering, restructuring.



M       ost complex man-made systems are designed and
       developed by breaking down their overall structure
into smaller, relatively independent units. In many fields,
                                                                                       In languages such as C, the support intrinsically given to
                                                                                    modularization is minimal. Data structures and functions
                                                                                    can be made private to a file by exploiting the access
one of which is software engineering, decomposition                                 specifier static. Therefore, in the following, the file will
driven by abstraction is the key to managing complexity.                            be considered the basic modular unit for C programs. The
A decomposed, modular computer program is easier to                                 programmer can violate the encapsulation that was
write, debug, maintain, and manage. A program consist-                              originally designed for a module, if one was, by means of
ing of modules that exhibit high internal cohesion and                              pointers, accessing any field of a given data structure, and
low coupling between each other is considered superior                              function pointers for the functions. Moreover, there are
to a monolithic one.                                                                situations in which encapsulation of data structures is not
   Inadequate modularization makes maintenance of old                               enforced although it would be desirable to have it. Direct
legacy systems often expensive and difficult. In some                               access to data structures is intermixed with the usage of
instances, the original modular structure of the program                            interface functions, while a more disciplined interaction of
may undergo degradation due to the violations introduced                            client modules could result in an improved maintainability
by successive maintenance interventions. In others, even the                        and understandability.
original design of the program was not conceived to be                                 This paper presents a novel approach to module
modular, resulting in an increasingly convoluted and, in the                        restructuring based on concept analysis. The notion of
end, unmanageable system.                                                           concept subpartition is introduced to obtain meaningful
   Improving the modular structure of a program is a form                           combinations of the concepts extracted by concept analysis
of preventive maintenance that is often necessary when the                          which can be extended to become candidate modulariza-
system undergoes new releases. In fact, modifying an                                tions of the original program. Concepts can be characterized
intricate code base may not be feasible unless a preliminary                        as groupings of objects sharing common attributes. 1
                                                                                    Functions and data structure accesses instantiate the
restructuring step is performed. In other cases, restructur-
                                                                                    notions of objects and attributes for the present application
ing becomes unavoidable if the system is to survive its
                                                                                    of concept analysis. Therefore, concepts represent the basic
growing entropy.
                                                                                    elements that determine the borders encapsulating func-
                                                                                    tions into modules. If the attributes are able to capture the
. The author is with the ITC-irst Centro per la Ricerca Scientifica e               internal structure accesses performed by the functions in
  Tecnologica, Povo (Trento), Italy. E-mail:                        the program, concepts and extended concept subpartitions
Manuscript received 4 Nov. 1998; revised 7 Mar. 2000; accepted 11 July 2000.
Recommended for acceptance by H. Muller.                                               1. Objects and attributes introduced in the framework of concept analysis
For information on obtaining reprints of this article, please send e-mail to:       should not be confused with objects and attributes of object-oriented, and reference IEEECS Log Number 108171.                           programming.

                                                                0098-5589/01/$10.00 ß 2001 IEEE
352                                                    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,     VOL. 27,   NO. 4,   APRIL 2001

result in highly cohesive module candidates, organized           oracles, assuming that their actual modularization is a good
around the data structure being manipulated. The data            one and corresponds to a common purpose. Concept
structures around which modules are built may be statical        analysis was able to exactly reconstruct the same modular-
(e.g., global variables) or dynamical (heap allocated) and       ization on about half of them and produced a very close
functions operating on them have their type in the signature     modular structure on the remaining ones. In some case
if they are not globally accessible. Consequently, three kinds   studies, the restructuring interventions suggested by con-
of attributes are considered: dynamic memory, signature          cept analysis and selected by examining encapsulation,
types, and global variables. A module can encapsulate a set      decomposition, and cost were actually implemented with
                                                                 the purpose of gaining knowledge about the real effort
of operations manipulating a common dynamically allo-
                                                                 required. The results show that improving encapsulation
cated data structure (e.g., a list or a tree). Moreover, a
                                                                 can be effectively supported by concept analysis and that
module can group functions receiving a user-defined data
                                                                 the initial directions obtained through it are extremely
structure as a parameter and operating on it. Finally, global
variables can be the shared structures around which a               The paper is organized as follows: The next section
module is built.                                                 presents the related work. Section 3 describes the basic
    When the modules of a program are restructured, two          elements of concept analysis, concept partitions, and the
contrasting factors have to be controlled: encapsulation and     proposed concept combinations represented by concept
decomposition. It is easy to obtain solutions to the             subpartitions. The last topic of this section deals with two
restructuring problem if only one of these factors is            novel metrics for encapsulation and decomposition assess-
considered. A program with all functions in a module has         ment. In Section 4, the notion of partition distance is
no encapsulation violations but has a low level of decom-        introduced as a means of evaluating restructuring costs.
position. On the contrary, assigning every function to a         Section 5 gives experimental results obtained for a test suite
distinct module produces the maximum decomposition, but          of public domain and industrial programs. Finally, Section 6
also the maximum encapsulation violations. A means of            is devoted to the conclusions.
evaluating the trade-off between encapsulation and decom-
position is suggested here and is based on proper measures
of the two factors to be compared with the original levels. In   2   RELATED WORK
fact, there is no absolute optimal value, but improvements       The related work deals with the identification of abstract
can be defined with respect to the starting point. Moreover,     data types and objects in the code. In [13], the main methods
additional criteria (e.g., work assignment) usually have to      for object identification are classified as global-based or
be accounted for, when modularizing or restructuring a           type-based, respectively, when functions are clustered
system, related to the different perspectives that can drive     around globally accessible objects or formal parameter
its decomposition.                                               and return types. A new identification methodÐbased on
    While the relative encapsulation and decomposition           the concept of receiver parameter typeÐis also proposed.
improvements determine the benefits of restructuring the         The approach presented in [3], which considers accesses to
program, a further element affecting the final decision is       global variables, uses an internal connectivity index to
cost. Estimating the effort required to reorganize a program     decide which functions should be clustered around the
according to a new modular structure is a hard task.             recognized object. Such a method is extended in [4] to
Nevertheless, a first coarse grain indicator is given by the     include type-based relations and it is combined with the
distance between the partition of the functions in the           strong direct dominance tree to obtain a more refined result.
original modules and in the new ones. Such a notion of           The recovery technique described in [24] builds a graph
distance is defined in this paper and an algorithm for           showing the references of procedures to structure internal
computing it is also provided. The encapsulation and             fields. Accesses to global variables drive the recognition of
decomposition measures, together with the distance from          object instances.
the original modularization, give a complete picture of the          Atomic components are detected and organized in a
required intervention. It is possible to graphically represent   hierarchy of modules, according to the method described in
the trade-off discussed above and to allow the programmer        [8]. Three kinds of atomic components are considered:
to choose among the available alternative modularizations        abstract state encapsulations, grouping global variables and
computed from concept subpartitions.                             accessing procedures, abstract data types, grouping user
    Experimental results suggest that concept analysis is an     defined types and procedures with such types in their
effective tool to drive module restructuring. Ten public         signature, and strongly connected components of mutually
domain and 10 industrial programs were analyzed in the           recursive procedures. Dominance analysis is used to
three contexts (dynamic memory access, function signature        hierarchically organize the retrieved components into
types, and global variable use). For all the considered          subsystems.
programs, the retrieved extended concept subpartitions               A radically different group of approaches for extracting
provide alternative modularizations which improve encap-         software components with high internal cohesion and low
sulation and/or decomposition metrics with respect to the        external coupling exploits the computation of software
original programs. The cost associated with each candidate       metrics. The ARCH tool [19] is one of the first examples of
transformation was evaluated and used to guide the               embedding the principle of information hiding turned into
selection. Programs having no encapsulation violations at        a measure of similarity between procedures within
all in any of the three considered contexts were used as         a semiautomatic clustering framework. Such a method
TONELLA: CONCEPT ANALYSIS FOR MODULE RESTRUCTURING                                                                               353

incorporates a weight tuning algorithm to learn from the            In fact, a concept is a grouping of programming entities
design decisions in disagreement with the proposed                  (e.g., functions) that share common attributes. Such
modularization. In [2], [5], the purpose of retrieving              attributes can be interpreted as a description of the
modular objects is reuse, while, in [18], metrics are used          commonalities within each module. On the contrary,
to refine the decomposition resulting from the application          modules recovered by means of clustering have to be
of formal and heuristic modularization principles. Another          inspected to trace metrics values back to the attributes
different application is presented in [11], where cohesion          originating them.
and coupling measures are used to determine clusters of                Module restructuring methods based on concepts suffer
processes. The problem of optimizing a modularity quality
                                                                    from the difficulty of determining partitions, i.e., nonover-
measure based on cohesion and coupling is approached by
                                                                    lapping and complete groupings of program entities. In
means of genetic algorithms in [15], which are able to
                                                                    fact, concept analysis does not assure that the candidate
determine a hierarchical clustering of the input modules.
Such a technique is improved in [16] by the possibilty of           modules it determines are disjoint and cover the whole
detecting and properly assigning omnipresent modules, of            entity set.
exploiting user provided clusters, and of adopting orphan              The novelty in the approach proposed in this paper is the
modules. In [14], a complementary clustering mechanism is           use of concept subpartitions instead of concept partitions.
applied to the interconnections, resulting in the definition of     The idea is that the overly restrictive constraint of
tube edges between subsystems.                                      partitions, requiring that the whole object set is covered,
   In [9], the star diagram is proposed as a support to help        can be removed, thus exploiting all the information
the programmer restructure a program by improving its               retrieved through concept analysis and otherwise lost with
encapsulation of abstract data types. Another decomposing           the concepts that are disregarded since they do not form a
and restructuring system is described in [17]. Both of them         complete partition. In addition, this paper proposes two
provide sophisticated interaction means to assist the user in       effective metrics for evaluating the benefits of restructuring
the process of analyzing and restructuring a program.               and a proper distance measure to estimate restructuring
   The most relevant works to the presented approach are            costs. The graphical representation of all these factors drives
applications of concept analysis to the modularization              the programmer in the selection of the subpartitions of
problem. In [7], [10], [21], concept analysis is applied to         interest.
the extraction of code configurations. Modules associated
with specific preprocessor directive patterns are extracted         3   CONCEPT ANALYSIS          AND ITS    SUPPORT      FOR
and interferences are detected. The relation between
procedures and global variables is analyzed by means of
concept analysis in [12]. The resulting lattice is used to          3.1 Basics
identify module candidates. Violations of encapsulation are         In this paper, concept analysis is not presented in detail. For
represented in the lattice and can be automatically handled.        a primer, the interested reader can refer to [20]. Only the
The lattice can also be transformed so as to become more            basic definitions are introduced and the results obtained for
suitable for modularization by exploiting the block relations,      a small example are discussed to informally illustrate the
additional procedure/global variable relations that extend          general ideas. In the following, the reference problem is the
the original ones. Concept analysis is used in [20] to identify     decomposition of a procedural program into modules
modules by considering both positive and negative in-               containing groups of functions. In C, this corresponds to
formation about the types of the function arguments and of          the organization of functions within different files.
the return value. Concept partitions correspond to possible            Concept analysis permits grouping objects that have
modularizations of the program. In this author's previous           common attributes. In the application of concept analysis to
work [23], encapsulation around dynamically allocated               modularization, objects are functions, while attributes are
memory locations is considered. Points-to analysis is used          properties of functions related to their encapsulation inside
to determine dynamic memory accesses, while concept                 modules. Examples of such attributes are the accesses to
analysis permits grouping functions around the accessed             global variables, the accesses to dynamic locations, and the
dynamic locations. The resulting clusters are plotted on a          presence of a user-defined structured type in the signature,
new diagram, the O-A (Objects-Attributes) diagram, allow-           including return type. Concept analysis is a general frame-
ing for the selection of the concepts more suitable to drive        work, rather than a specific modularization technique, that
the restructuring process. Concept analysis is exploited in         can be specialized by the particular choice of attributes that
[22] to reengineer class hierarchies. A context describing the      are considered in evaluating encapsulation. Combinations
usage of a class hierarchy is the starting point for the            of different kinds of attributes and the negation of attributes
construction of a concept lattice from which redesign hints         can be used as well.
can be derived.                                                        The starting point for concept analysis is a context
   The main difference between module restructuring                 …yY eY ‚†, consisting of a set of objects y, a set of attributes
based on clustering and module restructuring based on               e, and a binary relation ‚ between objects and attributes,
concepts is that the latter is intrinsically able to characterize   stating which attributes are possessed by each object. A
the restructured modules semantically, while the former             concept is a maximal collection of objects that possess
builds modules according to cohesion and coupling metrics.          common attributes, i.e., it is a grouping of all the objects that
354                                                          IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,        VOL. 27,   NO. 4,   APRIL 2001

                           TABLE 1                                      object set y. g€ ˆ f…ˆI Y ‰I †Y F F F Y …ˆn Y ‰n †g is a concept
                       Example of Context                               partition iff:
                                                                                          ˆi ˆ y   —nd   Vi Tˆ jY ˆi ’ ˆj ˆ YX            …Q†

                                                                           A concept partition allows assigning every function in
The objects are the functions fI Y fP Y fQ and the attributes are the   the considered context to exactly one module. In the
accesses to the dynamic memories rie€I Y rie€P Y rie€Q .
                                                                        example discussed above, the two following concept
                                                                        partitions can be determined:
share a common set of attributes. More formally, a concept
is a pair of sets …ˆY ‰ † such that:                                       . g€I ˆ f™I gX
                                                                           . g€P ˆ f™P Y ™Q gX
               ˆ ˆ fo P yjV— P ‰ X …oY —† P ‚gY                  …I†
                                                                        The first partition contains just one concept, ™I , and
                                                                        corresponds to modularizing the program by inserting all
                ‰ ˆ f— P ejVo P ˆ X …oY —† P ‚gX                 …P†    three functions, fI Y fP Y fQ , in the same module, on the basis
ˆ is said to be the extent of the concept and ‰ is said to be           of their shared access to rie€I . The second partition
the intent. There are several algorithms for computing the              generates a proposal of modular organization in which fI
concepts for a given context. The simple bottom-up                      and fP are inside a module, since they access both rie€I
algorithm described in [20] was used for this work.                     and rie€P , while fQ is put inside a second module for its
   The key observation for using concept analysis is that a             access to rie€I and rie€Q . It should be noted that the
                                                                        second modularization permits a violation of encapsulation
module or abstract data object corresponds to a formal
                                                                        since functions of different modules access a shared
concept. Let us consider, for example, the accesses to
                                                                        dynamic location, namely rie€I . It ensures that no
dynamic memory. A concept consists of a set of functions
                                                                        function outside ™P accesses both rie€I and rie€P , but
operating on a set of dynamic locations, while such
                                                                        rie€I alone is accessible. This example gives a deeper
locations are not simultaneously accessed by a function
                                                                        insight into the modularization associated with a concept
outside the concept.
                                                                        partition: Even in cases in which the only modularization
   An example of context is given in Table 1. The set of
                                                                        that does not violate encapsulation is the trivial one, with all
objects consists of the three functions, fI Y fP Y fQ , and the         functions in a module, concept analysis can extract
attributes are the three dynamic locations, rie€I Y                     alternative modularizations that do not ensure that every-
rie€P Y rie€Q , representing three unnamed data struc-                  thing is encapsulated, but are based on common attributes.
tures that are dynamically created on the heap (e.g., via               In such a case, the residual violations of encapsulation may
malloc, in C). Table 1 indicates (with a tick) the direct               be considered acceptable or may be removed with the
access of a function to some internal field of a dynamic                introduction of proper accessor/modifier functions.
location, thus, e.g., fI accesses rie€I and rie€P , while
fQ accesses rie€I and rie€Q . After applying concept                    3.3 Concept Subpartitions
analysis to this example, the following concepts are                    Concept partitions introduce an overly restrictive constraint
identified:                                                             on concept extents by requiring that their union covers all
                                                                        functions in the program. In many practical cases, the only
   . ™I ˆ …ffI Y fP Y fQ gY frie€I g†X                                  concept partition able to satisfy such a constraint contains
   . ™P ˆ …ffI Y fP gY frie€I Y rie€P g†X                               just one concept whose extent is the set of all program
   . ™Q ˆ …ffQ gY frie€I Y rie€Q g†X                                    functions. Consider, for example, the case of a program
   . ™R ˆ …fgY frie€I Y rie€P Y rie€Q g†X                               with a function that possesses no attribute (in the example
   Concept ™I indicates that all the three functions share              above, an additional function fR that does not access
access to rie€I . ™P states that fI and fP both access rie€I            dynamic locations). Such a function can only be in the
and rie€P . fQ is the only function accessing both rie€I                extent of a concept with empty intent, together with all
and rie€Q (concept ™Q ), while no function has the property             other functions. The only associated concept partition is the
of accessing all dynamic locations (™R ).                               trivial one, with all functions grouped in the extent of the
                                                                        only concept of the partition. More generally, when
3.2 Concept Partitions                                                  concepts are disregarded because they cannot be combined
Concepts are good candidates for the organization of                    with other concepts to cover all functions, important
functions into modules. In fact, each concept is, by                    information that was identified by concept analysis is lost
definition, characterized by a high cohesion of its objects             without reason. The usefulness of a group of concepts in
around the chosen attributes. However, concepts may have                identifying meaningful organizations of functions around
extents with nonempty intersections and, thus, not every                shared attributes should not be limited by the unnecessary
collection of concepts represents a potential modularization.           requirement that all functions are covered. In this paper, the
To address this problem, the notion of concept partition was            notion of concept subpartition in which the overly
adopted (see, for example, [20]). A concept partition consists          restrictive constraint is removed is proposed to replace
of a set of concepts whose extents are a partition of the               concept partitions. A concept subpartition associated with a
TONELLA: CONCEPT ANALYSIS FOR MODULE RESTRUCTURING                                                                                        355

given context is a set of concepts with disjoint extents.                  3.5 Encapsulation Violations
gƒ€ ˆ f…ˆI Y ‰I †Y F F F Y …ˆn Y ‰n †g is a concept subpartition iff:      A quality factor of a modularization is its ability to
                                                                           encapsulate functions around shared attributes. A measure
                        Vi Tˆ jY ˆi ’ ˆj ˆ YX                       …R†
                                                                           of such ability is the count of the violations of encapsulation
Concept partitions are particular cases of concept subparti-               associated with a given modularization of a program. The
tions where the union of the extents is the set y of all                   considered modularization may be both the original one or
objects.                                                                   that proposed by concept analysis through concept sub-
                                                                           partitions. To evaluate the number of violations of
3.4 Object Partitions                                                      encapsulation, each attribute of the considered context has
Partitions of the object set represent possible modulariza-                to be assigned to one of the object sets (modules) in the
tions of a program.2 The actual modules in a program                       modularization. Then, the count of the attributes possessed
can be regarded as an actual object partition of the                       by the objects in a module and assigned to a different
program since they group the functions of the program                      module gives the number of violations.
according to the source file they belong to. Such an object                Definition 3 (Attribute Assignment). Given a context
partition will be referred to as the original object partition               …yY eY ‚† and an object partition € , the attributes assigned
of the program and is associated with the original                           to each module wk are those with the highest number of
modularization of the program.                                               accesses from wk . An attribute — is assigned to the object set
   A concept subpartition induces a subpartition of the                      wk of the object partition € iff wk is the set with the
object set, which in turn can be extended to an object                       maximum number of objects possessing —.
partition. The object subpartition, induced by a concept
subpartition gƒ€ ˆ f…ˆI Y ‰I †Y F F F Y …ˆn Y ‰n †g is the set of the          — P ettr…wk † iff k ˆ —rg m—x jf…oY —† P ‚jo P wi gjY
extents, fˆI Y F F F Y ˆn g. It can be transformed into an object
partition , with reference to the original partition € , by
means of the partition subtraction (sub) operator:                         where ettr…wk † is the set of attributes assigned to the object
                                                                           set wk . The maximum cardinality of the considered subset
Definition 1 (Partition Subtraction). The partition subtrac-               of ‚ may be associated with multiple indexes i. In such
  tion of an object subpartition ƒ€ from an object partition €             cases, —rg m—x randomly chooses one of them. It will be
  gives the subpartition complementary to ƒ€ with respect to € .           shown that this arbitrary choice has no impact on the count
  It can be obtained by subtracting the union of the sets in ƒ€            of encapsulation violations.
  from each set in € .
                                                                           Definition 4 (Encapsulation Violations). Given a context
       € su˜ ƒ€ ˆ fwk ˆ wi À              wj jwi P € gX                      …yY eY ‚† and an object partition € , the encapsulation
                                         wj Pƒ€                              violation count i† …€ † is the total number of objects in each
                                                                             object set wi of € that possess an attribute assigned to a
€ sub ƒ€ is itself a subpartition because sets in € are                      different object set wj of € .
disjoint and remain such after the subtraction. The subtrac-
                                                                                i† …p† ˆ jf…oY —† P ‚jo P wi Y — P ettr…wj †Y i Tˆ jgjX
tion operator can be used to extend subpartitions to
partitions:                                                                    With reference to the example in Table 1, let us assume
Definition 2 (Subpartition Extension). An object subparti-                 that the original modularization of the program
  tion ƒ€ can be extended to an object partition , with                   is € ˆ fwI Y wP g ˆ fffI gY ffP Y fQ gg. Attributes rie€I Y
  reference to an original partition € , by the union of ƒ€ and            rie€P Y rie€Q can be assigned to the modules as follows:
  the subtraction of ƒ€ from € . The empty set is not considered           ettr…wI † ˆ frie€P g and ettr…wP † ˆ frie€I Y rie€Q g.
  an element of .                                                         In fact, rie€I and rie€Q are possessed, respectively, by
                                                                           two objects in wP vs. one object in wI and one object in wP
                    ˆ ƒ€ ‘ …€ su˜ ƒ€ † À YX                               vs. no object in wI . The attribute rie€P has one access
                                                                           from both wI and wP , thus it was arbitrarily assigned to
    If, for example, € ˆ fffI gY ffP Y fQ gg represents the
                                                                           wI . The resulting encapsulation violation number is two
original modularization of a program and ƒ€ ˆ fffI Y fP gg
                                                                           since fI P wI accesses rie€I , assigned to wP , and fP P
is the subpartition associated with a concept subpartition of
                                                                           wP accesses rie€P , assigned to wI . It should be noted
the program, the subtraction of ƒ€ from € gives ffgY ffQ gg,
                                                                           that the choice of assigning rie€P to wP would not
i.e., it gives all the functions not covered by the subpartition
                                                                           change the encapsulation violation count since exactly one
and grouped according to the original modularization. The
                                                                           violation in the access to rie€P would remain due to its
extension of ƒ€ is therefore  ˆ fffI Y fP gY ffQ gg.
                                                                           access from fI P wI . More generally, if the same maximum
    Extending subpartitions to partitions allows one to also
                                                                           number of accesses is detected in more than one module,
obtain a modularization of all the functions in the program
                                                                           all accesses are violations except those done by the chosen
in cases in which concept subpartitions instead of partitions
                                                                           module with no regard to the particular choice of the
are used. The extension involves considering the original
                                                                           module to which the attribute is assigned. If an extended
grouping of the functions into modules and using it to
                                                                           object subpartition of the example above is  ˆ fwI Y wP g ˆ
complete the subpartition.
                                                                           fffI Y fP gY ffQ gg (it is the object subpartition associated with
   2. Only object partitions not containing the empty set are considered   concept ™P ), attributes can be assigned as follows:
since adding a fictitious module with nothing inside is meaningless.       ettr…wI † ˆ frie€I Y rie€P g and ettr…wP † ˆ frie€Q g.
356                                                     IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,        VOL. 27,    NO. 4,   APRIL 2001

The encapsulation violation number becomes one and
accounts for the access from fQ P wP to rie€I .
3.6 Decomposition
The number of violations of encapsulation cannot be the
only measure that drives modularization. In fact, the trivial
modularization with all functions in a single module has an
encapsulation violation count of zero but is not acceptable.
The second factor that affects the quality of a modulariza-
tion is its ability to decompose the system into smaller,
more manageable, and meaningful subsystems. Therefore,
an evaluation of the quality of a modularization should
include a measure of the decomposition associated with it.
                                                                  Fig. 1. Three possible transformations of a partition produced by an
Given an object partition € , a simple decomposition
                                                                  elementary move.
measure is given by its size:

                        he™…€ † ˆ j€ jX                           Definition 5 (Elementary Transformation). Given a partition
                                                                    € , a new partition € H is produced from € by applying an
The number of modules in which the system is split is thus
                                                                    elementary transformation t if one object is moved from a set
used to account for the level of decomposition of the
                                                                    of € into another different set of € or is removed from a set of
                                                                    € and generates a new singleton set.
   Having few encapsulation violations and high decom-
position are opposite requirements in the choice of a
                                                                     The three situations that can occur when an elementary
modularization of the program. In extreme cases, it is
                                                                  transformation is applied to a partition are depicted in
possible to obtain i† …€ † ˆ H by inserting all functions in a
                                                                  Fig. 1. In Case 1, the cardinality of € is not changed. An
single module, but the corresponding decomposition is the
                                                                  object is removed from a set that does not become empty
minimal possible: one. On the other side, the highest
                                                                  and is added to an already existing set. In Case 2, the
decomposition is obtained by inserting a single function
                                                                  cardinality of € is incremented because an object is
into each module. In this case, the decomposition metric is
                                                                  removed from a set that does not become empty and
maximal and equal to the number of functions in the
                                                                  generates a singleton set. Finally, in Case 3, the cardinality
program: he™…€ † ˆ jp un™…€ †j, but the corresponding en-
                                                                  of € is decremented because an object is removed from a
capsulation violation number is also maximal: i† …€ † ˆ
                                                                  singleton set that becomes empty and is added to an
j‚j À jej. In fact, every attribute is arbitrarily assigned to
                                                                  already existing set. Note that the empty set that is
one of the modules accessing it since each module performs
                                                                  generated by this move is not considered as belonging to
at most one access. All accesses are violations except for
                                                                  the partition.
those made by the modules to which the attributes are
assigned. Their number is equal to the number of attributes,      Definition 6 (Partition Distance). The distance between two
jej, since each such module performs just one access (under         partitions is the minimum number of elementary transforma-
the hypothesis that no unaccessed attributes exist).                tions that can be applied to the first partition to produce the
   In real cases, the number of encapsulation violations            second partition.
should be limited and, at the same time, decomposition of                                          tI            tn
the system should be encouraged. For a given program, it is                                      3            3
                                                                               d…€ Y † ˆ minf€ À €I 3 F F F À gX
possible to assess the actual decomposition and encapsula-
tion levels through the metrics proposed above. A restruc-           The existence of such a measure for any pair of partitions
turing intervention aimed at improving the modularization         descends from the possibility of transforming any partition
of the program should compare the new decomposition and           into any other arbitrary partition through a sequence of
encapsulation levels with the original ones. An additional        elementary moves. A way to do this is to reduce the
element to be considered is the cost of the modification. A       partition to a collection of singleton sets by means of the
way to obtain a raw indication of such cost is described in       second move in Fig. 1. Then, such sets can be aggregated to
the next section.                                                 obtain any desired partition by means of the third move in
                                                                  Fig. 1. It is straightforward to show that the above definition
4     DISTANCE   BETWEEN     OBJECT PARTITIONS                    satisfies the requirements of distance. The axioms of
                                                                  distance require that the following conditions hold for any
The actual modular structure of a program must be                 partition € Y Y ‚:
compared with the modularization proposals coming from
concept analysis to gain indications on the cost of                  1.   d…€ Y † ! H and d…€ Y † ˆ H iff € ˆ . Being a
restructuring. For this reason, a notion of distance between              natural number, the partition distance is greater
object partitions is developed. In the following, the notion of           than or equal to zero. It is zero when a partition €
elementary transformation is introduced. Then, it is used to              can be transformed into  with zero elementary
define a measure of distance between partitions. Finally, an              moves, i.e., when € and  do not differ; vice versa, if
algorithm to compute such a distance is given. Partitions are             they do not differ they can be transformed into each
assumed not to contain the empty set.                                     other with zero moves.
TONELLA: CONCEPT ANALYSIS FOR MODULE RESTRUCTURING                                                                                      357

                                                                        since they originate longer sequences of elementary
                                                                        transformations in that no object can remain in the original
                                                                        set. Now, two new partitions are computed in which the
                                                                        transformation of the paired sets is completed. The number
                                                                        of elementary moves to accomplish this transformation is
                                                                        the cardinality of the symmetric difference (indicated with
                                                                        R) between the selected sets. In fact, this is the number of
                                                                        objects that are moved from the first set to their final
                                                                        destination or from the second set into the first one. It has to
                                                                        be augmented with the number of moves necessary to
                                                                        transform the two new partitions one into the other, i.e.,
                                                                        with the recursively computed distance between the two
                                                                        new partitions. Finally, the minimum is returned as the
                                                                        result of the computation.
                                                                           Let us consider the object partition € associated with the
                                                                        concept partition g€P of the example in the previous
                                                                        section, € ˆ fpI Y pP g ˆ fffI Y fP gY ffQ gg. If the actual modules
                                                                        of the program are qI ˆ ffI g and qP ˆ ffP Y fQ g, the original
                                                                        object partition is  ˆ fqI Y qP g ˆ fffI gY ffP Y fQ gg. The dis-
                                                                        tance between the two partitions can be computed by
                                                                        applying the algorithm in Fig. 2. The pairs of sets with
                                                                        nonempty intersection that are considered for transforma-
                                                                        tion are …pI Y qI †Y …pI Y qP †Y …pP Y qP †. When each of the three
                                                                        transformations is completed, the new partitions become
Fig. 2. Pseudocode of an algorithm that computes the distance between   equal and the recursive distance is zero. The symmetric
two object partitions.                                                  difference size is, respectively, one, two, one and, thus, the
                                                                        minimum is one. If a concept subpartition is considered
   2.   d…€ Y † ˆ d…Y € †. Commutativity follows from the             instead of a partition, it has to be extended to an object
        observation that every elementary move has an                   partition first.
        inverse. Move 1 in Fig. 1 has itself as an inverse                 The above notion of distance between object partitions is
        because the object ok can be reinserted into ƒnI by             appealing in the context of module restructuring because
        extracting it from ƒnP , which does not become                  elementary transformations correspond to moving a func-
        empty since ƒnP was not empty initially. Move 2                 tion from a module into another module. This can be
        has Move 3 as its inverse and vice versa, Move 3                considered a unit of measure for the restructuring effort
        has Move 2 as its inverse. In fact, Move 2 extracts             paid when the decision is to reorganize the modularization
        an object and generates a singleton set, while Move             by moving some functions across modules. It is a coarse
        3 inserts the object of a singleton set into an                 grain cost measure to be weighted with an estimate of the
        already existing set. Thus, any minimal sequence                interventions required by the move, but it is a first
        that transforms € into  has an inverse of the                  indication giving the total number of such moves. On the
        same length and no shorter sequence can transform               other hand, the distance between object partitions does not
         into € because its inverse would otherwise be                 account for a second decision that can be taken: The
        the minimal sequence from € to .                               functions can remain in their original module and the
   3. d…€ Y † d…€ Y ‚† ‡ d…‚Y †. The concatenation of                 violations of encapsulation are resolved by modifying the
        the minimal sequence from € to ‚ and from ‚ to                  code of the functions or they are considered acceptable and
         is a legal sequence of elementary transformations             no intervention is performed to remove them. Therefore,
        from € to . Therefore, the minimal sequence from               the cost of moving functions between modules is not the
        € to  can only be shorter than or equal to such a              only factor to examine: The presence of residual violations
        concatenation.                                                  has to be evaluated. In addition, the new modularization of
   Fig. 2 shows the pseudocode of an algorithm that                     the program should not worsen the level of decomposition
computes the distance between two object partitions. It is              in order to gain in encapsulation. To summarize, to get the
a recursive algorithm ending when the two input partitions              whole picture of costs and benefits of a module restructur-
are equal and, thus, their distance is zero. If the two                 ing intervention, the encapsulation and decomposition
partitions are not equal, the minimum number of elemen-                 levels should be compared with the initial ones and the
tary transformations to convert the first one into the second           cost of each restructuring alternative should be estimated.
one has to be determined. x, the total number of objects in
each partition, is initially assigned to the support variable
                                                                        5   EXPERIMENTAL RESULTS
min. In fact, this is an upper bound for such a minimum.
Then, for each pair of sets from the two partitions that are            The proposed approach to module restructuring based on
different and have a nonempty intersection, the elementary              concept analysis was applied to 10 public domain and
transformations to turn the first one into the second one are           10 industrial programs, written in C language. The front
applied. Pairs with empty intersection can be disregarded               end of CANTO [1] (Code and Architecture Analysis Tool)
358                                                          IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,          VOL. 27,   NO. 4,   APRIL 2001

                                                              TABLE 2
                                 Test Suite of Public Domain (Top) and Industrial (Bottom) Programs

The size of the programs in Lines Of Code (LOC) is given in Column 2. Columns 3 and 4 contain the number of functions and modules. The number
of objects and attributes for each of the three considered contexts is shown in the next columns.

was used to extract the information needed for concept                  Names of the programs in the industrial test suite are not
analysis from the code.                                                 given for reasons of confidentiality. Their application
   CANTO [1] is composed of several subsystems: a front                 domain ranges from banking to telecommunications,
end to parse C code, an architectural recovery environment,             computer-aided design, and multimedia database manage-
a static analyzer, an interface for graph displaying, and a             ment. The table gives the size of each program in Lines Of
customized editor. The user, in a closed loop, can analyze a            Code (LOC). The next columns contain the number of
system, navigate through different views by means of a                  functions and the number of modules. Then, for each of the
graphical user interface, generate queries and new views,               three considered contexts, the associated number of objects
and add and remove components, subsystems, and code to                  and attributes is shown.
accomplish maintenance tasks. Among the static analyses                    By considering the total size in LOC and the number of
available from CANTO, the points-to analysis [6] is the                 modules (Table 2), it can be noted that programs in the test
most important for the present work since it provides a                 suite exhibit a high variability in the granularity of modules.
static solution to the problem of determining the accesses to           More particularly, the size of each individual module in the
dynamically allocated data structures. In fact, the result of           public domain programs ranges from one LOC to 14,662
points-to analysis is a set of points-to pairs associating              LOC, with an average of 702.5, while, in the industrial test
pointers to the (possibly) pointed-to locations, where the              suite, it is between 69 and 4,949 LOC (average 1,742.9). The
locations may either be static or dynamic. Results are                  number of functions per module is also an indicator of high
approximate (exact solutions are in general not computable)             variability. In fact, in the public domain, code modules
but safe, i.e., the pointed-to locations are possibly a superset        contain one to 130 functions each (with an average of 9.1),
but never a subset of the true set.                                     while industrial modules contain a number of functions
   Three different kinds of attributes were considered for              between one and 56 (average 4.9). This is an indicator of the
encapsulation improvement: the accesses to dynamically                  strong dependence of module granularity on the applica-
allocated memory locations, the structured types in the                 tion domain, the programming style, the development
function signature, including the return type, and the                  software, and many other factors resulting in a high
definitions and uses of global variables. Correspondingly,              variability of module size and function number.
three contexts were generated for each program and                         For the first considered context, the number of objects in
restructuring directions were obtained by concept analysis,             Table 2 is the number of functions accessing some dynamic
aimed at improving the encapsulation, respectively, of                  memory, while the number of attributes is the number of
dynamic memories, structured types, and global variables.               dynamic locations. In the second context, only the functions
                                                                        with structured types in the signature are considered, and
5.1 Test Suite                                                          the number of such types is the number of attributes.
Table 2 contains the public domain3 programs at the top,                Finally, the last context relates functions to global variables.
while the industrial programs are listed at the bottom.                 On average, the number of functions involved in the three
                                                                        contexts is, respectively, 57.8, 41.2, and 111.3, while the
   3. Actually, most programs in the public domain test suite are
distributed under the GNU General Public License as published by the    number of attributes is 38.4, 9.0, and 313.7. Thus, the third
Free Software Foundation.                                               context involves about twice the number of functions in the
TONELLA: CONCEPT ANALYSIS FOR MODULE RESTRUCTURING                                                                          359

                           TABLE 3                                                        TABLE 4
         Original Number of Encapsulation Violations                Number of Concepts for the Public Domain and Industrial
        for the Public Domain and Industrial Programs                    Programs in the Three Considered Contexts
               in the Three Considered Contexts

                                                                  inserted. Such work is very expensive, especially if it has to
first and second contexts, while the attributes in the three      be replicated on every program in the test suite and for
contexts are highly variable in number, reaching a max-           every context. Thus, restructuring was evaluated in a blind
imum in the third context again.                                  way, considering all retrieved attributes as candidates for
    The organization of functions into modules, i.e., their       encapsulation. In a more realistic use, a manual selection of
distribution among source files, was considered in order to       the relevant attributes is preliminarly performed and only
assess the initial number of encapsulation violations for         the related violations are considered. This approach was
each of the three contexts. Table 3 contains such values,         followed in some case studies taken from the presented test
representing the number of functions accessing attributes of      suite and discussed below.
another module. Regarding the accesses to dynamic                    Concept analysis was performed for the 20 programs
memory, all public domain programs show some violations           considered on a Sun SPARC 20 with 64 Mb of internal
of encapsulation, with only one exception, gdbm. Industrial       memory and one Gb of swap area under normal load
programs have many fewer violations in the access to              conditions. Table 4 contains the number of concepts found
dynamic memory. In fact, only two programs have modules           for each program in each context. No concept was
accessing dynamic locations not belonging to them. The            determined for those programs with empty context. The
second context is the one with the minimum number of              third context, access to globals, which has the highest
encapsulation violations in the programs considered.              number of objects and attributes, is the one that generates
Structured types of different modules are in the signature        the highest number of concepts. Then, concepts were
of some functions only in five public domain and three            combined to form concept subpartitions. The number of
industrial programs. In addition, the number of violations is     possible combinations of k concepts taken from a set of
generally low. Finally, the access to global variables by         n concepts is the binomial coefficient of n and k. Therefore,
external modules is very frequent in that all programs            the total number of subpartitions to check could be
exhibit some violations of this kind. This could indicate that    exponential in the number of concepts. A timeout of
global variables are commonly used as a means to exchange         10 hours was fixed to stop subpartition computation in
information between modules, rather than a data structure         cases in which the number of concepts is too high.
around which to encapsulate the related computation.              Subpartitions are formed in increasing order so that, when
    Encapsulation violations considered in Table 3 simply         the computation is stopped, higher order subpartitions are
obey the rule that a module has a function accessing an           not determined. For the considered programs, it was
attribute from another module. This is not always unde-           possible to complete such a computation for all the contexts
sired behavior. For example, global variables may be              in which no more than 30 concepts were found.
intentionally shared among modules, types could be in                The average number of subpartitions determined with-
the signature of functions that do not manipulate them but        in the 10 hours timeout is 183,703.4, 52,912.9, and
act as accessors returning the structure to be manipulated        161,426.9 in the three contexts, respectively. If concept
by means of encapsulated functions, and dynamic strings           partitions are considered instead, such average numbers
may be accessed from anywhere without violating encap-            dramatically decrease to 1.17, 1.61, and 1.05, respectively.
sulation unless strings are themselves encapsulated. There-       In fact, in many cases, the only disjoint concept combina-
fore, a better starting point for restructuring is a context in   tion that covers the whole object set is the top concept,
which only attributes intended to be encapsulated are             with all objects in the extent and typically empty intent.
360                                                            IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,           VOL. 27,   NO. 4,   APRIL 2001

Fig. 3. Program minicom: restructuring cost for different decomposition
                                                                          Fig. 4. Program minicom: restructuring cost as a function of the
and violation relative levels, in the first context (dynamic memory
                                                                          decomposition (b I) and encapsulation violation (` I) relative levels,
                                                                          considered independently, in the first context (dynamic memory access).

Considering subpartitions has an experimental validation
                                                                          encapsulation violations, the restructuring cost increases
in its capability to extract many nontrivial concept
                                                                          with the decomposition level. Typically, an improvement in
combinations that are otherwise missed.
                                                                          encapsulation violations can be obtained more easily (with
   Concept subpartitions were extended to be partitions of
                                                                          a lower cost) if a decrease in decomposition is accepted.
the whole object set. To evaluate the resulting modulariza-
                                                                          Points on the horizontal axis represent solutions with no
tion against the original one, the proposed measures of
                                                                          encapsulation violations at all. In such cases, all accesses to
encapsulation violations, decomposition, and partition
                                                                          the selected attributes do not cross the boundaries of the
distance have been employed. For subpartitions with the
same encapsulation violations and decomposition, the one
                                                                             Fig. 4 shows the cost of reducing encapsulation viola-
with the minimum distance was chosen. The diagram
                                                                          tions or increasing decomposition for the minicom
representing this distance for each associated encapsulation
                                                                          program. The horizontal axis is divided into two intervals,
violation and decomposition level was computed for every
                                                                          from 0 to 1 and from 1 to 2. Points in the 0, 1 range
program in the test suite. The levels of encapsulation
                                                                          represent relative encapsulation violation levels for the
violations and decomposition were considered relative to
the original ones by computing the ratio between the two.                 concept subpartitions. The associated restructuring cost,
Ratios also permit a comparison between restructuring                     estimated as partition distance from the original modular-
actions on different programs.                                            ization, is the vertical displacement. Encapsulation viola-
   An example of such a diagram for the minicom                           tion costs considered in this diagram are the minimum
program, in the first considered context, is shown in Fig. 3.             values with respect to the decomposition levels. Points in
The shapes of the diagrams for the other programs are                     the 1, 2 range represent restructuring costs to improve
slight variants of that in Fig. 3. A cost equal to zero is placed         decomposition. Minimum values with respect to the
at the coordinates (1, 1) since this is the initial level of              different encapsulation violation levels are considered. This
decomposition and encapsulation violation. Ratios between                 diagram is useful when restructuring is mainly focused
the encapsulation violations in the restructured and in the               only on encapsulation and decomposition can become
original programs are low, thus indicating that the mod-                  worse or vice versa on decomposition, with the possibility
ularizations determined by concept analysis are consistent                of increasing encapsulation violations. The plot of the costs
with the choices made by the programmers. They are                        associated with the restructured modularization found by
comparable in granularity and organization. Furthermore,                  concept analysis suggests that low levels of encapsulation
they often allow for the improvement of encapsulation                     violations and high levels of decomposition require ex-
and/or decomposition. Points in the lowest region corre-                  pensive restructuring interventions. While reducing encap-
spond to a reduction in the number of encapsulation                       sulation violations, the associated restructuring cost is not
violations, while points in the rightmost region represent an             monotone for most programs in the test suite, thus
increased decomposition. The results depicted in Fig. 3                   indicating that substantial improvement may be obtained
show that, for the minicom program, there are opportu-                    at costs as low as those for minor improvements. On the
nities for restructuring. In fact, several points are in the              contrary, the cost for increasing decomposition has a more
lowest rightmost region with fewer encapsulation viola-                   regular monotonic plot. Costs for decreasing encapsulation
tions and increased decomposition. This is often true also                violations are generally higher than costs for increasing
for the other programs in the test suite, within the first and            decomposition at the same relative improvement level, for
third contexts, while, for the second context, a decrease                 most considered programs. The same kind of plot for both
in encapsulation violations is often paid in terms of                     costs can be observed for all three considered contexts, but
decreased decomposition. In addition, for a given level of                the third context is characterized by a much higher cost
TONELLA: CONCEPT ANALYSIS FOR MODULE RESTRUCTURING                                                                           361

                          TABLE 5                                 signature-type-based context with other attributes (dy-
  Relative Decomposition of the Extended Subpartitions and        namic location or global variable accesses) by exploiting
  Distance from the Original Modules for the Programs with        the knowledge of the relevance of the attribute for the
                No Encapsulation Violations                       searched modularization. By performing such an exten-
                                                                  sion on most of the programs in this category, it was
                                                                  possible to exactly reconstruct the original modularization.
                                                                     Two programs, gzip and flex, need special explana-
                                                                  tion. In gzip, the two functions _getopt_internal and
                                                                  getopt_long are extracted from their original module,
                                                                  getopt.c, by concept analysis, the reason for this being
                                                                  that the other three functions in this module do not
                                                                  manipulate struct option type data. Actually, two of
                                                                  them, namely my_strlen and my_index, are general
                                                                  string manipulation routines that do not share anything
                                                                  with _getopt_internal and getopt_long and are
                                                                  correctly taken separated. The other function, exchange,
range. Such high costs are associated with eliminating            shares the access to the command line string with the two
global variable accesses from outside the modules defining        encapsulated functions. If this access is modeled as an
them, i.e., making all global variables static.                   additional attribute, concept analysis is able to group it with
                                                                  the other two extracted functions.
5.2 Assessing Concept Analysis Modularization                        In flex, the module sym.c implements a symbol table.
In the first two contexts, there are some programs with no        It exports several interface functions to manipulate the
encapsulation violations at all. They can be used to assess       symbol table, but it also contains the functions implement-
the modularization capability of concept analysis: Attri-         ing the hash table on which the symbol table is based.
butes are already encapsulated in such cases and it is likely     Concept analysis separates the hash table management
that the encapsulation is based on a common purpose.              functions from the more general symbol manipulation
Therefore, concept analysis should determine a subpartition       functions and assigns them to two distinct modules. By
whichÐwhen extendedÐgives a modularization close to               building an extended context based on the struct
the original one.                                                 hash_entry type and the access to struct hash_entry
   Table 5 gives the list of all the programs in the first and    type dynamic locations, all low level functions operating on
second contexts without encapsulation violations and with a       a hash table can be isolated and extracted. Symbol table
nonempty context. For each of them, the subpartition              manipulation functions use only interface functions to the
without encapsulation violations with minimum distance            hash table.
from the original modularization was determined. Such a
distance is shown in the next columns, after the decomposi-       5.3 Case Studies
tion level, given as a fraction of the original decomposition.    Some of the restructuring interventions suggested by
   On five of the 12 examined programs, concept analysis          concept analysis were actally implemented on two of the
was able to exactly reconstruct the same modularization as        analyzed programs to obtain a deeper insight into the
in the original programs by only exploiting information           required actions and the resulting systems.
about the attributes (dynamic memory or signature types)             less is a UNIX utility to display a text file on a terminal
of the involved functions. On nine of the remaining               with the possibility of backward movements. In the
programs, concept analysis modularization has a distance          second context, shown in Table 6, it has just one
of 1 from the original modularization and, in the last two        encapsulation violation that can be eliminated by incre-
cases, such distance becomes 2. Thus, when the modular-           menting the decomposition level. The distance of this new
ization extracted by concept analysis is not exactly the          modularization from the original one is three. The
original one, it is very close to it. Distance values of 1 or 2   detected encapsulation violation is due to the presence
correspond to removing one or two functions from the              of type struct scrpos in the signature of functions
original module and inserting them into a new module,
                                                                  store_pos and get_pos from file ifile.c and
thus increasing the decomposition level.
                                                                  function get_scrpos from file position.c. If all
   The cases with the remodularized program different
                                                                  computation on the struct scrpos type is encapsulated
from the original one basically have two explanations.
                                                                  inside a separated file, two problems arise. As a field of
Some modules group functions that are logically related
but do not share any attribute mapped into a program-             an ifile dynamic structure manipulated inside ifile.c
ming construct. For example, modules manipulating                 is of type struct scrpos, the new module accesses its
devices at a low level use a file descriptorÐrepresented          private fields, thus violating encapsulation of dynamic
as an integerÐto access the devices. Such a feature cannot        memory (first context). Such a violation can be considered
be represented by a proper attribute that can be auto-            acceptable as the new module exports all operations on
matically extracted from the code (checking accesses to           struct scrpos data. Furthermore, an accessor function
integer variables is too coarse a condition). In the other        returning such a field is required in module ifile.c so
basic situation, modules cannot be characterized by only          that client modules can pass it to the new module without
one kind of attribute. Typically, the user can extend the         violating encapsulation. In the new module scrpos.c,
362                                                    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,     VOL. 27,   NO. 4,   APRIL 2001

                         TABLE 6                                 part of the module window.c. Restructuring interventions
  Second Context (Structured Types) for the Program less         found by concept analysis include a subpartition, with a
                                                                 cost equal to 42, allowing a 55 percent reduction of the
                                                                 violations and an increased decomposition. It was selected
                                                                 from the alternative subpartitions by exploiting the plots in
                                                                 Figs. 3 and 4. Among the points on the left plot, the one at
                                                                 coordinates (1.06, 0.45) with cost 42 exhibits an interesting
                                                                 trade-off between encapsulation, decomposition, and cost.
                                                                 If the minimum cost to improve encapsulation is considered
                                                                 with no regard to decomposition (Fig. 4) the same
                                                                 subpartitionÐpositioned at coordinates (0.45, 42)Ðappears
                                                                 as the best choice by giving a high improvement at
                                                                 minimum cost.
                                                                     The selected subpartition consists of one concept with 45
                                                                 functions in the extent and one dynamic location of type
                                                                 WIN in the intent. By manually examining the statements
                                                                 inside those functions accessing the WIN type dynamic
                                                                 structure from outside window.c, it is evident that most of
                                                                 the accesses do not implement a meaningful and recogniz-
the two functions get_scrpos and store_pos can be
                                                                 able operation on WIN data. Thus, such accesses can be
merged and become function copy_scrpos. When
                                                                 replaced by simple get or set functions working on WIN
encapsulated in scrpos.c, the ifile structure in their
original signature is replaced by a second scrpos                attributes. There are actually three functions performing a
structure. As a consequence, the action performed is the         more general operation, namely scrollback, drawhist,
same, provided the actual parameters are exchanged in            and getline from minicom.c. It is possible to incorpo-
calls to store_pos, with respect to get_pos. To avoid            rate them in the window.c module, thus extending its
accesses to an internal dynamic structure of position.c,         operations on WIN data structures. The three selected
two accessor functions are added to this module, respec-         functions also operate on a global location named us, of
tively, returning an index in a dynamic table and the            type WIN, which is static to their original module,
value associated with an index.                                  minicom.c. Therefore, to move and extract them from
   Alternative solutions to improve encapsulation are            the original context, it is necessary to extend their signature
considering the computation on struct scrpos as a part           so as to include a pointer to the global WIN location that is
of the computation performed inside position.c or                manipulated. With some other minor changes, it was
inside ifile.c. The first solution still has the disadvan-       possible to encapsulate such operations inside the module
tage that private fields of a dynamic structure belonging to     window.c and to obtain a new version of the program with
ifile.c are manipulated from inside position.c. The              no encapsulation violation to the WIN type data structures
second solution is probably the best one since it eliminates     and with the same decomposition level.
all undesired accesses. In fact, if the three considered             The final solution for the minicom program is slightly
functions are inserted into ifile.c, no external module          different from the one associated with the selected sub-
manipulates the ifile structure fields, no external func-        partition. The reason is twofold: First, several violations
tion manipulates scrpos structures, and the accesses to the      were removed by simply providing get and set attribute
dynamic table from position.c can be avoided by means
                                                                 manipulation operations; second, the functions recognized
of the two accessor functions discussed above.
                                                                 as meaningful manipulations to be encapsulated were
   This example highlights that improving encapsulation is
                                                                 inserted, rather than becoming a new separated module,
never a trivial task and substantial work is required on the
                                                                 in the module window.c since this is the natural site for
part of the programmer to evaluate the alternative solutions
                                                                 them. As a consequence, the final decomposition level is
and to take into account the whole picture. Nevertheless,
the initial hints were determined through concept analysis       unchanged, instead of increased.
and shown to be very very useful.                                    This second case study highlights the blind nature of
   minicom is a free communication program. Features             concept analysis with respect to function semantics. All
include a dialing directory with auto-redial, support for        manipulations are considered equivalent, while a manual
UUCP-style lock files on serial devices, a separate script       inspection reveals that, for some of them, the availability of
language interpreter, capture to file, and multiple users        an accessor/modifier suffices, while others require a deeper
with individual configurations.                                  reworking, making them general encapsulated functions.
   In the first context, there are 22 encapsulation violations   Nevertheless, concept analysis was a good starting point for
associated with a dynamic location of type WIN. The data         the identification of the interventions to be performed and
structures of this type implement a portable character-based     the selected subpartition contained useful restructuring
window system for which all manipulating functions are           suggestions.
TONELLA: CONCEPT ANALYSIS FOR MODULE RESTRUCTURING                                                                                              363

6     CONCLUSION                                                           [13] P.E. Livadas and T. Johnson, ªA New Approach to Finding
                                                                                Objects in Programs,º Software Maintenance: Research and Practice,
This paper focused on the use of concept analysis for                           vol. 6, pp. 249±260, 1994.
                                                                           [14] S. Mancoridis and R.C. Holt, ªRecovering the Structure of
module identification. By extending concept subpartitions                       Software Systems Using Tube Graph Interconnection Clustering,º
to cover the whole object set, a modularization candidate is                    Proc. Int'l Conf. Software Maintenance, pp. 23±32, 1996.
determined for which the variations in encapsulation and                   [15] S. Mancoridis, B.S. Mitchell, Y. Chen, E.R. Gansner, ªUsing
decomposition are quantified. In addition, a measure of                         Automatic Clustering to Produce High-Level System Organiza-
                                                                                tions of Source Code,º Proc. Int'l Workshop Program Comprehension,
distance from the original modular structure of the program                     pp. 45±52, 1998.
provides some indications of the cost of the restructuring                 [16] S. Mancoridis, B.S. Mitchell, Y. Chen, and E.R. Gansner, ªBunch: A
interventions.                                                                  Clustering Tool for the Recovery and Maintenance of Software
                                                                                System Structures,º Proc. Int'l Conf. Software Maintenance, pp. 50±
   The proposed approach to module restructuring was                            59, 1999.
applied to 10 public domain and 10 industrial programs.                    [17] H.A. Muller, M.A. Orgun, S.R. Tilley, and J.S. Uhl, ªA Reverse
Alternatives with respect to the original modularizations                       Engineering Approach to Subsystem Structure Identification,º
                                                                                Software Maintenance: Research and Practice, vol. 5, no. 4, pp. 181±
were determined by concept analysis. The graphical plot of                      204, 1993.
the restructuring cost for each encapsulation and decom-                   [18] D. Paulson and Y. Wand, ªAn Automated Approach to Informa-
position relative level was a helpful tool when determining                     tion Systems Decomposition,º IEEE Trans. Software Eng., vol. 18,
                                                                                no. 3, pp. 174±189, Mar. 1992
the selection of extended concept subpartitions. Concept                   [19] R.W. Schwanke, ªAn Intelligent Tool for Re-Engineering Software
analysis was also able to extract modularizations identical                     Modularity,º Proc. Int'l Conf. Software Eng., pp. 83±92, 1991.
or very similar to those in the programs without encapsula-                [20] M. Siff and T. Reps, ªIdentifying Modules via Concept Analysis,º
                                                                                Proc. Int'l Conf. Software Maintenance, pp. 170±178, Oct. 1997.
tion violations. This is a strong hint of the possibility of
                                                                           [21] G. Snelting, ªReengineering of Configurations Based on Mathe-
capturing the organization of functions around the ma-                          matical Concept Analysis,º ACM Trans. Software Eng. and
nipulated data structures by analyzing proper access                            Methodology, vol. 5, no. 2, pp. 146±189, 1996.
attributes through concept analysis. The execution of some                 [22] G. Snelting, F. Tip, ªReengineering Class Hierarchies Using
                                                                                Concept Analysis,º Proc. Sixth Int'l Symp. Foundations of Software
complete restructuring interventions suggested by concept                       Eng., Nov. 1998.
analysis highlighted the nontrivial nature of such interven-               [23] P. Tonella, ªUsing the O-A Diagram to Encapsulate Dynamic
tions, but also enforced the intuition that very useful                         Memory Access,º Proc. Int'l Conf. Software Maintenance, pp. 326±
                                                                                335, Nov. 1998.
suggestions can come from concept subpartition computa-                    [24] A. Yeh, D. Harris, and H. Reubenstein, ªRecovering Abstract Data
tion, especially when coupled with encapsulation and                            Types and Object Instances from a Conventional Procedural
decomposition measures and restructuring cost estimates.                        Language,º Proc. Working Conf. Reverse Eng., pp. 227±236, 1995.

REFERENCES                                                                                        Paolo Tonella received the laurea degree cum
[1]  G. Antoniol, R. Fiutem, G. Lutteri, P. Tonella, and S. Zanfei,                               laude in electronic engineering from the Uni-
     ªProgram Understanding and Maintenance with the CANTO                                        versity of Padua, Italy, in 1992, and the PhD
     Environment,º Proc. Int'l Conf. Software Maintenance, pp. 72±81,                             degree in software engineering from the same
     Oct. 1997.                                                                                   university, in 1999, with a thesis entitled ªCode
[2] G. Caldiera and V.R. Basili, ªIdentifying and Qualifying Reusable                             Analysis in Support to Software Maintenance.º
     Software Components,º Computer, pp. 61±70, 1991.                                             Since 1994, he has been a full time researcher
[3] G. Canfora, A. Cimitile, M. Munro, and C. Taylor, ªExtracting                                 of the Software Engineering Group at IRST
     Abstract Data Type from C Programs: A Case Study,º Proc. Int'l                               (Institute for Scientific and Technological Re-
     Conf. Software Maintenance, pp. 200±209, Sept. 1993.                                         search), Trento, Italy. He has participated in
[4] G. Canfora, A. Cimitile, M. Tortorella, and M. Munro, ªA Precise       several industrial and European Community projects on software
     Method for Identifying Reusable Abstract Data Types in Code,º         analysis and testing. His current research interests include software
     Proc. Int'l Conf. Software Maintenance, pp. 404±413, Sept. 1994.      engineering, reverse engineering, object-oriented programming, and
[5] J. Esteva, ªAutomatic Identification of Reusable Components,º          code analysis.
     Proc. Seventh Int'l Workshop Computer-Aided Software Eng., pp. 80±
     87, July 1995.
[6] R. Fiutem, P. Tonella, G. Antoniol, and E. Merlo, ªPoints-to
     Analysis for Program Understanding,º J. Systems and Software,
     vol. 44, no. 3, pp. 213±227, Jan. 1999.
[7] P. Funk, A. Lewien, and G. Snelting, ªAlgorithms for Concept
     Lattice Decomposition and Their Application,º technical report,
     Computer Science Dept., Technische Univ. Braunschweig, 1995.
[8] J.F. Girard and R. Koschke, ªFinding Components in a Hierarchy
     of Modules: A Step Towards Architectural Understanding,º Proc.
     Int'l Conf. Software Maintenance, pp. 72±81, Oct. 1997.
[9] W. Griswold, M. Chen, R. Bowdidge, and J. Morgenthaler, ªTool
     Support for Planning the Restructuring of Data Abstractions in
     Large Systems,º Proc. Int'l Conf. Foundations of Software Eng.,
     pp. 33±45, 1996.
[10] M. Krone and G. Snelting, ªOn the Inference of Configuration
     Structures from Source Code,º Proc. 16th Int'l Conf. Software Eng.,
     pp. 49±57, May 1994.
[11] T. Kunz, ªEvaluating Process Clusters to Support Automatic
     Program Understanding,º Proc. 19th Int'l Workshop Program
     Comprehension, pp. 198±207, Mar. 1996.
[12] C. Lindig and G. Snelting, ªAssessing Modular Structure of
     Legacy Code Based on Mathematical Concept Analysis,º Proc.
     19th Int'l Conf. Software Eng., pp. 349±359, May 1997.

To top