Source Code Defect Detection by zra16726


									    Source Code Defect Detection

   Yuehua Zhang, Ying Liu, Lingling Zhang, Yong Shi

  Graduate University of Chinese Academy of Sciences
Research Center on Fictitious Economy and Data Science,
 Related Work
 Our Approach
 Experimental Results
 Future Work

      @ Ying Liu    2009-6-3   2
 Programs usually follow many implicit programming rules,
 e.g. function call lock usually followed by unlock later on
 In real programs, programming rules are more complex
 and not documented explicitly
 Programmers who are not aware of the rules or forget
 them will potentially bring defects into the code
 Our approach
    Automatically detect the implicit programming rules in the source
    Detect potential defects that violate the programming rules

        @ Ying Liu            2009-6-3                              3
Related Work
 Specification generation
   Use inferred program semantic properties for unit test
   generation and selection
    • Tool-Assisted Unit Test Selection Based on Operational
      Violations (Tao Xie and David Notkin)
   Discover specifications of the protocols that code must
   obey when interacting with APIs
    • Specification mining (Ammons et al)
   Extract specifications from programs’ dynamic executions
   to detect dynamic invariants
    • Dynamically discovering likely program invariants to support
      program evolution (M. D. Ernst et al)

      @ Ying Liu            2009-6-3                                 4
Related Work
 Specification-based checking
   Report inconsistencies between the code and
    • LCLint: a light-weight static checker, using predefined
      specifications (D. Evans et al)
   Use a static analysis method to refine specification for
   error detection
    • The user first provides a partial specification of a procedure,
      and then the specification is iteratively refined via
      counterexample analysis (M. Taghdiri et al)

      @ Ying Liu             2009-6-3                                   5
Related Work
 Data mining in software engineering
   Use a data mining technique to extract programming
   rules from software code and detect violations to the
   extracted rules (Zhenmin Li et al)
    • PRMiner — Automatically extracting implicit programming
      rules and detecting violations in large-scale software code

      @ Ying Liu             2009-6-3                               6
Our Approach
 Use frequent itemset mining to extract frequent
 Generate program rules from the observed patterns
 Detect violations to the rules by scanning the code
 Prune false negatives
 Use frequent sequence mining to identify copy-paste
 code fragments
 Automatically check such copy-paste fragments for
 potential bugs

      @ Ying Liu        2009-6-3                       7
Associations Rules Mining
 Association rules mining
   Detect sets of attributes or items that frequently co-
   occur in many database records and rules among
 Frequent itemsets mining is the core
 Typical applications
   Basket data analysis
   Catalog design
   Sale campaign analysis
   Web log (click stream) analysis

      @ Ying Liu          2009-6-3                          8
Basic Concepts

   Transaction-id   Items bought
        10            A, B, D
        20            A, C, D
        30            A, D, E
        40             B, E, F
        50          B, C, D, E, F

 Item collection X = {x1, …, xm}
 Itemset: a set of items, k-itemset
 Transaction T ⊆ X, each T associates a unique Tid and
 items bought by a customer
 Rule form P=>Q, P ⊂ X, Q ⊂ X, P ∩ Q = ∅

      @ Ying Liu             2009-6-3                    9
Basic Concepts
 support, s, probability that a
 transaction contains P and Q
    support (P => Q) = P(P∩Q)                       Customer    Customer
                                                    buys both   buys P
 Frequent itemset, occurrence
 greater than a min_support
 Frequent itemset mining, find all
 the rules P => Q satisfying
 min_support                             buys Q
    Let supmin = 50%,
    frequent Itemsets {A:3, B:3, D:4,
    E:3, AD:3}
    support (A) = 3/5 = 60%, support
    (AD) = 3/5 = 60%

        @ Ying Liu            2009-6-3                             10
Associations Rules Mining
 Apriori (R. Agrawal, 1994)
   Use prior knowledge of frequent itemsets
   A Candidate Generation-and-Test Approach
   Iterative approach, level-wise search
   Initially, scan DB once to get frequent 1-itemset
   Generate length (k+1) candidate itemsets from length k frequent
   Test the candidates against DB
   Terminate when no frequent or candidate set can be generated

       @ Ying Liu            2009-6-3                           11
Frequent Sequence Mining
 Add timing factor into frequent itemsets mining
 An example:
  A sequence database

  SID    Sequence
  10     <a(abc)(ac)d(cf)>      An element may contain a set of items.
  20     <(ad)c(bc)(ae)>        Items within an element are unordered
                                and we list them alphabetically.
  30     <(ef)(ab)(df)cb>
  40     <eg(af)cbc>            <a(bc)dc> is a subsequence of

  Given support threshold min_sup =2, <(ab)c> is a sequential pattern

        @ Ying Liu             2009-6-3                                  12
Our Approach

Source code                                Parsing source code

       Extracting programming
            patterns                       Preprocessing data

                                           Mining programming patterns

        Generating rules                   Generating programming rules

                                           Detecting violations
         Detecting violations

                                           Ranking and reporting bugs

                   Suspectable Detects
      @ Ying Liu                2009-6-3                                  13
Our Approach
    Extracting programming patterns
①   Parsing source code

       Use GCC with option –fdump-ast-original-raw to obtain
       X.original file of the source code (this file contains the abstract
       Divide the X.original file into functions
       Scan each function and extract the useful items, e.g. variable
       name, data type specification, keyword, operator, control logic,
       Form a database
            Row-wise: function
            Column-wise: useful items from above

         @ Ying Liu              2009-6-3                               14
Our Approach
②   Data preprocessing

      Clean items which occur too frequently, e.g. int, float
      Use the common characteristics of local variables, e.g. data
      types, to represent the local variables
      Prefix every identifier based on its functionality, avoiding
      naming conflicts
          A function call lock would be prefixed with F-, as F-lock
          A global variable lock would be prefixed with G-, as G-lock

       @ Ying Liu            2009-6-3                              15
Our Approach
 Example of parsing a function:
       Source code                              Preprocessed identifiers
       Linux-   T-Scsi_Host
       L2030 – 2157                             ......
       int __devinit twa_probe (struct          T-Scsi_Host
       pci_dev *pdev,...)                       F-scsi_host_alloc
       {                                        T-scsi_host_template
       struct Scsi_Host *host = NULL;           ......
       ......                                   F-scsi_add_host
       host = s csi_host_alloc                  T-Scsi_Host
       (&driver_template, ...);
       retval = scsi_add_host( host, &p dev-
       >dev) ;                                  F-scsi_scan_host
       ......                                   T-Scsi_Host
       scsi_scan_host(host );                   ......

       @ Ying Liu                    2009-6-3                              16
Our Approach
③ Mining programming patterns

     Assume frequent patterns are programming rules with high
     Use frequent itemset mining algorithm Apriori to find all the
     frequent sub-itemsets

       @ Ying Liu             2009-6-3                               17
Our Approach
    Generating programming rules
①   For a given k-frequent itemset, I          k
       Generate a rule like I k −1 ⇒ I1 , where I1 is a single item of I k ,
       and I k −1 is the rest part
       Support of the rule: support ( I k )

         @ Ying Liu               2009-6-3                                 18
Our Approach
    Detecting violations
①   Detecting violations
                                                       Support _ count ( I k −1¬I1 )
       Propose a concept, violation_probability = 1-
                                                         Support _ count ( I k −1 )
       The higher the violation_probability, the more likely the violation
       is a defect
       Choose the violations whose violation_probability in the range [t,
       1), t is a user specified threshold

           A rule {alloc, add} ⇒ {scan} in pattern {alloc, add, scan},
                                     Support _ count ({alloc, add }¬{scan})        6
           violation_probability =      Support _ count ({alloc, add })       =   32

         @ Ying Liu               2009-6-3                                             19
Our Approach
②   Ranking and reporting bugs
        Rank all the violations by violation_probability in decending
        Prune false violations manually, e.g. elements in a
        programming rule spanning across multiple functions

    The reported violations are susceptible defects
    Programmers may be suggested to check with their code for the

          @ Ying Liu             2009-6-3                               20
Experimental Results
   Linux source code (version 2.6.18)
   Containing about 3500 C files, 3 million lines of
  code, 73,000 distinct functions

 Parameters setting
   min_support 15, violation_probability 10%

 Parser GCC 4.1.2

       @ Ying Liu         2009-6-3                     21
Experimental Results
 A programming rule generated by our proposed approach
    2030 int __devinit twa_probe(struct pci_dev *pdev, ...)
    2031 {
    2032 struct Scsi_Host *host = NULL;
    2051 host = scsi_host_alloc ( ... );
    2105 retval = scsi_add_host ( host, &pdev->dev );
    2138 scsi_scan_host( host );
    2157 }

                    (a) Programming rule in twa probe
       @ Ying Liu                2009-6-3                     22
Experimental Results
 Violation observed
     778 struct scsi_id_instance_data *sbp2_alloc_device
     (struct unit_directory *ud)
     779 {
     782 struct scsi_id_instance_data *scsi_id = NULL;
     856 scsi_host = scsi_host_alloc( . .. );
     865 if (! scsi_add_host( scsi_host, &ud->device )) {
     // scsi_scan_host( scsi_host ) missing!
                     (b) Violation in sbp2_alloc_device
       @ Ying Liu                 2009-6-3                  23
Ongoing Work
 Use frequent sequence mining algorithm to find copy-
 past fragments
 Automatically check the potential bugs in copy-paste
 code fragments, including renaming defect, identifier
 missing etc.

      @ Ying Liu         2009-6-3                        24
 Programs usually follow many implicit programming rules,
 e.g. function call lock usually followed by unlock later on
 Programmers who are not aware of the rules or forget
 them will potentially bring defects into the code
 By using our approach, implicit programming rules can
 be extracted automatically in large-scale source code
 Violations are then detected automatically
 Programmers can be suggested to check with the
 potential bugs

       @ Ying Liu         2009-6-3                       25
Future Work
 Use other software’s source code as testing data
 Reduce false negatives in detecting violation by the
 following steps:
     Combine our approach with inter-procedural
     analysis in case of rules spanning across multiple
     Use model checking techniques to check different
     control paths

       @ Ying Liu         2009-6-3                        26

To top