Research Methods talk, December 8, 2005.
Finding and Organizing CS Research Material
OWEN K ASER Dept. of Computer Science and Applied Statistics UNB Saint John Do as I say, not as I do. . .
(Assume you have an undergrad’s ability to find CS books in a paper library; let’s go beyond that. . . )
Research Methods talk, December 8, 2005.
Overview
% Part I: Finding % Part II: Organizing
Research Methods talk, December 8, 2005.
Good Academic Citizenship
% Diligently search. Avoid “independently discovering” something that you
could have easily found with a search. It’s sloppy scholarship and may look like plagiarism. Yes, it can always happen. Act to reduce the probability!
% Diligently record and organize. Otherwise uncited and unconscious idea
borrowing likely. Avoid unintended plagiarism.
Research Methods talk, December 8, 2005.
Part I: Finding CS Research Materials
Issues for “Findables”:
% kind (media: electronic vs paper; forum: journal, conference, blog...) % reliability % hairiness % currency % authorship
Research Methods talk, December 8, 2005.
Kinds of Materials : Articles
% Survey papers (eg, in ACM Computing Surveys) % Journal articles % Conference/Workshop/Symposium/Colloquium proceedings % White papers, Technical Reports, Student projects (!) etc. % Emails from peers/researchers % Discussion forum postings, web log (blog) entries, web pages
Research Methods talk, December 8, 2005.
Kinds of Materials : Other
% Data (test inputs, benchmarks) % Bibliographies, annotated or not. % Programs: freedom as in beer, or as in speech? Or not at all?
Legal issues:
% access might require nondisclosure agreement (NDA) % implications of GPL. Other open-source licenses.
Research Methods talk, December 8, 2005.
Reliability of Journal Articles
Peer-review system for journals; high-quality journals thus pretty reliable. But never trust blindly. Danger: I can start Acta Owensis. Your supervisor will know the quality journals in your area. Hints: look at the publisher/sponsor. IEEE, ACM should be safe. Many commercial publishers (Springer, Elsevier) have quality reputation to protect. (But still publish a spectrum.) A paper is typically revised (improved) after the review(s). Quality issues: clarity, novelty, correctness.
Research Methods talk, December 8, 2005.
Reliability of Conference Articles
% degree-of-refereedness of a given conference ∈ [0, careful]. % Reviewing typically rushed: focus on semi-clarity and novelty. Correct? % Clear quality differences between conferences. Acceptance rate:
rough guide ⇒ typically mentioned in proceedings. Also look for IEEE, ACM. . . as sponsor/publisher.
% “Publish or perish” makes good authors publish (some) bad papers.
Research Methods talk, December 8, 2005.
Grey Literature
There is much value in unrefereed forms. Use discretion.
% Tech. Reports (of CS depts/ research labs). Self-published. Preprints or
“valuable but not publishable”. (Hint: make them as a form of IP protection.) Pre-1990, paper only and you might pay a few $.
% Emailed/photocopied preprints (informal “old boys” networking) % “White papers”: spectrum from marketing drivel to “could be TR” % Patents are obfuscated. Examiners historically lack CS background.
Research Methods talk, December 8, 2005.
Other Unrefereed Communication
% Mail/email from an author you contacted % Postings on blogs, newsgroups, web fora % Information on general web sites % Wikipedia etc. : a special case
Warning: Fools, cranks, liars, plagiarists, and biased people love soap boxes.
Research Methods talk, December 8, 2005.
Currency of Material
CS moves fast, but careful review/revision/publication takes time.
% journal articles: gestation period 1 to 5+ years (useless?) % conference article: several months to a year old % grey literature: as soon as an idea is written up roughly, it can be
disseminated! Quality vs currency tradeoff In CS, refereed conferences are very important.
Research Methods talk, December 8, 2005.
Authorship
Authors have reputations (if good, they want to protect) Authors usually stick to a few areas per decade
⇒ If you liked Dr X’s recent paper, you should look at her past papers and
pay attention to her upcoming work. Researchers usually have websites with their TRs, abstracts, publication lists etc. Increasing use of blogs. NSERC: list of funded Cdn researchers and their project summaries.
Research Methods talk, December 8, 2005.
Using Research Articles
Most articles you find will be only slightly important. You need to know “related work”, sorta. Read many article abstracts, Browse quite a few article introductions, related work sections, Read some article introductions, Read a few completely, Study a very few completely. Put most effort on understanding papers from high-quality fora.
Research Methods talk, December 8, 2005.
Other Materials: Data
For specific research areas, test data sets are often available. How else can you compare your XYZ technique against the others? Third-party source is best; otherwise, someone may have preselected data favouring them. You should not preselect.
Research Methods talk, December 8, 2005.
Other Materials: Bibliographies
If you can find a recent, comprehensive bibliography in your area, you are lucky. Annotated? You’re doubly lucky. Note: don’t trust annotations.
Research Methods talk, December 8, 2005.
Other Materials: Software
You may need tools to use. You may need something to modify. Latter case: you need open source [or good contacts and an NDA]. Sourceforge is a big open-source repository. Also, savannah. Also, by subarea: HPC has “netlib”, Algorithms has Skiena’s site, etc.
Research Methods talk, December 8, 2005.
Finding: Published works
Topic categories can help you narrow a journal search. In CS, ACM has a system of categories. Used in their Computing Reviews and Guide to Computing Literature. Adopted by other publishers. http://www.acm.org/class/ Browse it to see where your topics will fit.
Research Methods talk, December 8, 2005.
Finding Journal Articles
Identify relevant journals for your area. Get added to their email table-of-content service, or subscribe to their RSS feed. Most journals publish an annual end-of-volume index in the last issue of a given volume. Typically by author and by category. Useful in library. Tedious, to search volume by volume. “Sooo pre-Internet”
Research Methods talk, December 8, 2005.
Finding Published Articles
There are “index services” such as INSPEC, ISI Web of Knowledge that you’ve probably been told about already. Also ACM Guide to Computing Literature. Go through our library server.
Research Methods talk, December 8, 2005.
Keyword Searches
To search by keyword, you must know the words. Problem: You have an idea and make your own name for some concept. How do you know it has not been invented before, with a different name? In the future, there will be ontologies for each specialized area. But not yet. Guess-and-google? Read “related work” sections of related work?
Research Methods talk, December 8, 2005.
Full Text Search
% CiteSeer (has many preprints, tech reports, etc.) The best. % Google Scholar % Google (too much junk?) % ACM Digital library, IEEE etc...
Research Methods talk, December 8, 2005.
Searching Bibliographies/ CSB
In CS, BibTeX format is the de facto standard. Often a keen expert in some area will begin to build a bibliography in a narrow field. On an obscure private server. Perhaps expert is less keen in a few years. Disadvantage+advantage: coverage is spotty. Combine together all these little bibliographies: useful “warehouse”. Includes technical reports lists of many big CS depts.
http://liinwww.ira.uka.de/bibliography/index.html wonderful.
Research Methods talk, December 8, 2005.
DBLP
DBLP: originally bibliography for Data Bases / Logic Programming. In recent years, they’ve broadened to all CS. Very useful to get a chronological list of papers for an author. Focus is on publications, not grey literature. Highly recommended.
Research Methods talk, December 8, 2005.
arXiv
arXiv (arxiv.org) is popular for math and physics. Only 1k CS articles vs 100k physics. Contrast to CiteSeer (citeseer.ist.psu.edu) , which has “zillions” of CS articles, grey and preprint. Not recommended for CS.
Research Methods talk, December 8, 2005.
Survey Articles
Survey articles briefly describe how the various subfields in a field fit together. Also cite the main papers in each subfield. Good survey: great way to start in a new area. “ACM Computing Surveys” periodical. Also, typically the chapters in “Handbook of XYZ”. Recent surveys don’t exist for all areas.
Research Methods talk, December 8, 2005.
Chasing Citations
Papers are linked by citations. Found a good paper?
⇒ papers they cited: basic background, earlier work ⇒ papers citing them: followup work.
Every good paper stumbled on should lead you to a cluster of interesting papers (citation cluster). CiteSeer is great for this. (Also CiteSeer allows author homepage search.)
Research Methods talk, December 8, 2005.
More on Citations (Controversial)
High impact papers are cited often. Journals/Conferences with many highly cited papers are typically better. Citations help find “influential” people in your chosen area. Look for someone with several highly cited papers (avoid the ‘one lucky paper’ syndrome).
http://citeseer.ist.psu.edu/mostcited.html
Note: not all CS folks. 277 CS papers cite A. Einstein. . . And all D. Johnsons are lumped together.
Research Methods talk, December 8, 2005.
More on CiteSeer
CiteSeer takes submitted articles and tries to parse them. Automatic BibTeX generation: I get co-author “Unb Saint John”. Frequently, CiteSeer messes up BibTeX entries. Authors can update, but often don’t. Good: allows paper download in various formats. Bad: Often using preprint/ TR versions, and “real publication” is available elsewhere. (Copyright issues?)
Research Methods talk, December 8, 2005.
Organizing Research Materials
% Ensure bibliographic data recorded when you collect the article. % If promising, record a few notes about article. % File something (paper or electronically)
Research Methods talk, December 8, 2005.
Make a Bibliography
A I advise: use BibTeX. (and hence LTEX.)
Enter every promising paper in BibTeX. Use an unused field to write yourself a note, summarizing the paper’s importance to you. You will be thankful later.
Research Methods talk, December 8, 2005.
Organize Physical Papers
file folders, filing cabinets. Today, many papers have a little footer on their first page, saying where they appeared. If this information is missing, take 20 seconds and scrawl it on the paper. Months later, when you discover the paper and want to cite it, you will thank me for saving you from an annoying search. Better: add all papers you print/copy to your BibTeX file.
Research Methods talk, December 8, 2005.
Organize Electronic Downloads
Downloaded papers often have computer-generated (useless) names. Maintain subdirectories by topic; rename files. Record bibliographic info. Alternative: keep only a URL (and info to help you)
Research Methods talk, December 8, 2005.
Use Modern Tools
Seen the physics research complex with whiteboards even in the tunnels? When an idea’s hot, record it. Bounce it off someone (supervisor?) Useful tools
A % (older) LTEX “comments” for running discussions between authors
% Wiki (the tool for Wikipedia) % CVS (concurrent versions system) and Viewcvs % Aggregators
Research Methods talk, December 8, 2005.
UNBSJ Problem
But. . . who will set up and maintain tools? Where’s the backed-up and secure UNBSJ Unix server to host them? Those who don’t know how useful the tools are, will not insist that they become part of the university infrastructure.
Research Methods talk, December 8, 2005.
Wikis for Research
Wiki: tool for collaborative Web authoring Wiki’s language is less tedious than HTML Authorship/viewing rights should be controlled. (Idea thieves and hackers exist.)
Research Methods talk, December 8, 2005.
Wiki: a Lemur Page Snippit
Research Methods talk, December 8, 2005.
Zooming in
Disadvantage: determining new content. (Aggregators?)
A Older alternative: LTEX conversations work well with viewCVS.
Research Methods talk, December 8, 2005.
CVS: Free Version Control
Your thesis chapters, programs, articles go through versions. They have multiple authors. Version control is for more than S.E. artifacts. Learn and use a modern tool. I switched from RCS to CVS a few years back.
A AFAIK, CVS is clunky with Word documents. Great with LTEX.
Research Methods talk, December 8, 2005.
ViewCVS: Web Interface
Research Methods talk, December 8, 2005.
ViewCVS: files in some module
Research Methods talk, December 8, 2005.
How Did a BibTeX File Change?
Research Methods talk, December 8, 2005.
Data Generation/Analysis
% Automate the process. You’re CS, so script it! Makes it painless, when
you find a bug, to regenerate and reanalyze.
% Save your data/programs: if you’re challenged a year later, you can
respond. Experiments must be repeatable; you, as scientist, must ensure this.
Research Methods talk, December 8, 2005.
Good Academic Citizens Give Good References
When you write (article, thesis, . . . ), you should contribute
% good, correct references to the most related work % references to the most authoritative version available.
Thus, check the references others give. Errors are frequent. Don’t accept X’s word on what Y said.
Research Methods talk, December 8, 2005.
Get the Authoritative Version
% Authority: TR < Conference < Journal. % If you have one version, actively search for a better version. Redo
this search right before submitting thesis/paper. Things change. Titles change. Main authors don’t.
% But: don’t assume journal version is a superset of the TR version! Referees sometimes force deletions.
Research Methods talk, December 8, 2005.
Conclusion
% There’s much good, relevant information out there,
plus much that’s good but irrelevant, plus much that’s garbage [or outdated].
% There are good tools and strategies for finding “stuff”,
and organizing it, and working on it with a team. I put some useful links on
http://sjwebserver.unbsj.ca/~owen/gsresources.html