impact

Shared by: HC120727051119
Categories
Tags
-
Stats
views:
6
posted:
7/26/2012
language:
English
pages:
15
Document Sample
scope of work template
							N-gram Topic Models for
  Bibliometric Analysis
   Gideon Mann, David Mimno,
      and Andrew McCallum


 Can topic models provide better
 measurements of the impact of
 research literature?
            Bibliometrics and
             Scientometrics
Typically analyzes patterns of citations in
research literature

Derek de Solla Price: “Little Science, Big
Science”

Eugene Garfield: Science Citation Index, Journal
Citation Reports
Comparing apples to apples: top
    journals by citations
Biochemistry and molecular biology:
      J. Biol. Chem          405017
      Cell                   136472
      Biochem.-US             96809

Mathematics
      Lect. Notes Math          6926
                                       Source:
      T. Am. Math. Soc          6469   Journal
                                       Citation
      J. Math. Anal. Appl.      6004   Reports
                                       (2004)
   What’s wrong with grouping
          by journal?
• 10 of the 200 most cited papers in CiteSeer
  are unpublished technical reports, 15% of
  most cited papers are from conference
  proceedings
• Open-access publication increasing, but venue
  information often not available
• Hand entered ISI citation data noisy
• Article has only one venue, journals cover
  many topics
A topic model for N-grams
             Determine whether the
             next word will be part of
             an n-gram based on the
             current word and the
             current hidden topic.
             “White house” is a
             collocation in politics, but
             may not be one in real
             estate.
        Sample n-gram topics
1. Digital Libraries (102): digital, electronic, library,
   metadata, access; “digital libraries”, “digital library”,
   “electronic commerce”, “dublin core”, “cultural heritage”
2. WWW (129): web, site, pages, page, www, sites; “world
   wide web”, “web pages”, “web sites”, “web site”, “world
   wide”
3. Ontologies (186): semantic, ontology, ontologies, rdf,
   semantics, meta; “semantic web”, “description logics”,
   “rdf schema”, “description logic”, “resource description
   framework”
4. Web services (184): web, services, service, xml,
   business; “web services”, “web service”, “markup
   language”, “xml documents”, “xml schema”
     Assigning topics to documents
1. Build a 200 topic n-gram topic model on 300k
   documents
2. Remove stopword or methodological topics (e.g.
   “efficient, fast, speed”)
3. For each document d, if more than 10% of d’s tokens
   are assigned to topic t, and that comprises more than
   two tokens, assign d to t

   Each topic is now an intellectual “domain” that includes
   some number of documents. We can substitute topic
   for journal in most traditional bibliometric indicators.
   We can also now define several new indicators.
                Impact Factor
Journal Impact Factor: Citations from articles
  published in 2004 to articles in Cell published
  in 2002-3, divided by the number of articles
  published in Cell in 2002-3.

2004 Impact
 factors from      Nature               32.182
 JCR:              Cell                 28.389
                   JMLR                  5.952
                   Machine Learning      3.258
Topic Impact Factor
      Broad Impact: Diffusion
Journal Diffusion: # of journals citing Cell divided
by the total number of citations to Cell, over a
given time period, times 100

Problem: relatively brittle at low citation counts. If
a topic/journal is cited twice by two different
topics/journals, it will have high diffusion.
       Broad Impact: Diversity
Topic Diversity: Entropy of the distribution of citing topics

Better at capturing broad end of impact spectrum: the
high diffusion topics are identical to the least frequently
cited topics
       Broad Impact: Diversity
Topic Diversity: Entropy of the distribution of citing topics

Topic diversity can also be measured for papers:
    Longevity: Cited Half Life
Two views:
• Given a paper, what is the median age of citations to
  that paper?
• What is the median age of citations from current
  literature?
  History: Topical Precedence
Within a topic, what are the earliest papers that received
   more than n citations?

Information Retrieval (138):
On Relevance, Probabilistic Indexing and Information Retrieval, Kuhns and
    Maron (1960)
Expected Search Length: A Single Measure of Retrieval Effectiveness
    Based on the Weak Ordering Action of Retrieval Systems, Cooper
    (1968)
Relevance feedback in information retrieval, Rocchio (1971)
Relevance feedback and the optimization of retrieval effectiveness, Salton
    (1971)
New experiments in relevance feedback, Ide (1971)
Automatic Indexing of a Sound Database Using Self-organizing Neural Nets,
    Feiten and Gunzel (1982)
Sharing: Topical Transfer

						
Related docs
Other docs by HC120727051119
chapter 13 powerpoint l
Views: 15  |  Downloads: 0
Chapter 273
Views: 4  |  Downloads: 0
biology introduction
Views: 0  |  Downloads: 0
Genetic Disorder Project Rubric
Views: 51  |  Downloads: 0
Eleventh Grade
Views: 1  |  Downloads: 0
Year 8 Biology Assessment
Views: 7  |  Downloads: 0
Labwriteupdnaosmosisanddialisistube
Views: 0  |  Downloads: 0
DeBakey HSHP Mission
Views: 14  |  Downloads: 0
The Cell Review Game
Views: 0  |  Downloads: 0
sp10labreviewmini
Views: 0  |  Downloads: 0