Searching the Web for Source Code by qao20272

VIEWS: 0 PAGES: 18

									Searching the Web
 for Source Code
   Raphael Hoffmann
      Daniel Weld

  October 30th, 2006
Why Search for Code?
             How can I parse
                 an XML
               document?
 What are developers searching for?


“Most of the code is open source so you can reuse it.
  But I don’t think that’s the primary use – it’s more
  about how to learn about things and when you’re
  building open-source packages, to make sure
  you’re doing it the right way.”

                                Tom Stocky, Product Manager
                                         Google Code Search
What are developers searching for?




  What developers using the MSN search engine are interested in finding
Why not use web search engines?
  parse xml java

                   Only compatible with new Java versions


                   Requires installation of external library,
                                  but no link


                     Code on pages essentially the same



                         Contains no code examples
Code Search Engines
            Index source code of open-source
            Projects (from compressed archive
            Files and CVS repositories)

            Code is parsed and terms in type
            names, variable names, etc. are
            weighted differently.

            import javax.xml.parsers.*;
            import org.w3c.dom.*;
            public class JAXPSample {
              public static void main(String[] args) {
                String filename = "sample.xml";
                try {
                  DocumentBuilderFactory factory =
                    DocumentBuilderFactory.newInstance();
                  DocumentBuilder parser =
                    factory.newDocumentBuilder();
                  Document d = parser.parse(filename);
                } catch (Exception e) {
                  System.err.println("Exception: " + e.getMessage());
                }
              }
            }
Why not use code search engines?
  parse xml java
                                  Irrelevant
                            (An Emacs Lisp File!?!)

                            Code is complicated,
                   contains no comments related to query,
                      and is more than 300(!) lines long

                   Requires installation of external library,
                                  but no link


                     Code on pages essentially the same
                 Analysis


• Web search engines use structure on web
  pages, but ignore structure in code

• Code search engines use structure in code,
  but ignore structure on web pages.
Our solution

    • A novel hybrid search
      engine
    • Index code snippets
      found on web pages
    • Link them to required
      libraries and
      documentation
             Assieme



    links to                links to
required libraries     pages with snippets




                          group pages with
                           similar snippets
                 Assieme
• High precision
• Results contain documentation and simple
  examples
• Dependencies in code are shown along with
  links to download required libraries
• Page summaries show relevant
  packages/types being used
• Pages with similar snippets grouped together
                        Assieme Operation

                1.   Crawling web pages and libraries
Preprocessing




                2.   Indexing libraries
                3.   Extracting snippets from web pages
                4.   Finding the code referenced in snippets
Query Time




                • Handling Queries
                Assieme Operation
                          Crawling web pages           preprocessing   query
                                                                        time
                             and libraries




Tutorial and Technical article
Sites, e.g.
IBM DeveloperWorks,                       Open-source sites, e.g.
and pages with keywords                   Sourceforge.net, Apache.org
                 Assieme Operation
                                                                         query
                         Indexing libraries              preprocessing
                                                                          time




                          packages
                          javax.mail.search
                          javax.mail.event
                          ...
                         types
              Language    javax.mail.SecuritySupport12
                Parser    javax.mail.Address
                          ...

                         methods / fields
mailapi.jar                                                     library index
                          javax.mail.Flags.add(...)
                          javax.mail.add(...)
                          ...
            Assieme Operation
                         Extracting snippets                        preprocessing   query
                                                                                     time
                           from web pages

<a name="listing1">
<b>Listing 1. MBean interface for a Web
server</b></a><br />

<table width="100%" cellpadding="0"
           cellspacing="0" border="0">
 <tr>
  <td class="code-outline">                            Classifier
   <pre class="displaycode">
      public interface WebServerMBean {
        public int getPort();                             uses                 Snippet
        public String getLogLevel();
        public void setLogLevel(String level);   • HTML structure                or
        public boolean isStarted();
        public void stop();
                                                 • Terms/n-gram statistics      Text?
        public void start();                     • Language parsers
      }
   </pre>
  </td>
 </tr>
</table
><br /> <p> Implementing the MBean class is
usually fairly straightforward, as the MBean
                         Assieme Operation
                                      Finding the code             preprocessing   query
                                                                                    time
                                   referenced in snippets


                                                                    referenced
public class WebServer implements WebServerMBean
                                                                      libraries
{ ... }
...
WebServer ws = new WebServer(...);
MBeanServer server =
 ManagementFactory.getPlatformMBeanServer();         Language       referenced
server.registerMBean
 (ws, new ObjectName                                  Parser           types
  ("myapp:type=webserver,name=Port 8080"));




                                                                    referenced
                                                                   methods, fields


                                                   library index
                  Assieme Operation
                        Handling Queries           preprocessing   query
                                                                    time




                             Cluster       add links
Query      TF/IDF
                            based on          to
           scoring
                            ref. code      libraries


 Weights for terms in
    text, url, title,
   used packages,
 types, methods, …
        Conclusion & Outlook
• To make code search work well, we need to
  exploit both, structure on web pages, and
  structure in code.
• Next, we want to better visualize and
  support library version information.
• Also, we are interested in linking
  Troubleshooting information on the web,
  such as error messages, to code.

								
To top