impact
Document Sample


N-gram Topic Models for
Bibliometric Analysis
Gideon Mann, David Mimno,
and Andrew McCallum
Can topic models provide better
measurements of the impact of
research literature?
Bibliometrics and
Scientometrics
Typically analyzes patterns of citations in
research literature
Derek de Solla Price: “Little Science, Big
Science”
Eugene Garfield: Science Citation Index, Journal
Citation Reports
Comparing apples to apples: top
journals by citations
Biochemistry and molecular biology:
J. Biol. Chem 405017
Cell 136472
Biochem.-US 96809
Mathematics
Lect. Notes Math 6926
Source:
T. Am. Math. Soc 6469 Journal
Citation
J. Math. Anal. Appl. 6004 Reports
(2004)
What’s wrong with grouping
by journal?
• 10 of the 200 most cited papers in CiteSeer
are unpublished technical reports, 15% of
most cited papers are from conference
proceedings
• Open-access publication increasing, but venue
information often not available
• Hand entered ISI citation data noisy
• Article has only one venue, journals cover
many topics
A topic model for N-grams
Determine whether the
next word will be part of
an n-gram based on the
current word and the
current hidden topic.
“White house” is a
collocation in politics, but
may not be one in real
estate.
Sample n-gram topics
1. Digital Libraries (102): digital, electronic, library,
metadata, access; “digital libraries”, “digital library”,
“electronic commerce”, “dublin core”, “cultural heritage”
2. WWW (129): web, site, pages, page, www, sites; “world
wide web”, “web pages”, “web sites”, “web site”, “world
wide”
3. Ontologies (186): semantic, ontology, ontologies, rdf,
semantics, meta; “semantic web”, “description logics”,
“rdf schema”, “description logic”, “resource description
framework”
4. Web services (184): web, services, service, xml,
business; “web services”, “web service”, “markup
language”, “xml documents”, “xml schema”
Assigning topics to documents
1. Build a 200 topic n-gram topic model on 300k
documents
2. Remove stopword or methodological topics (e.g.
“efficient, fast, speed”)
3. For each document d, if more than 10% of d’s tokens
are assigned to topic t, and that comprises more than
two tokens, assign d to t
Each topic is now an intellectual “domain” that includes
some number of documents. We can substitute topic
for journal in most traditional bibliometric indicators.
We can also now define several new indicators.
Impact Factor
Journal Impact Factor: Citations from articles
published in 2004 to articles in Cell published
in 2002-3, divided by the number of articles
published in Cell in 2002-3.
2004 Impact
factors from Nature 32.182
JCR: Cell 28.389
JMLR 5.952
Machine Learning 3.258
Topic Impact Factor
Broad Impact: Diffusion
Journal Diffusion: # of journals citing Cell divided
by the total number of citations to Cell, over a
given time period, times 100
Problem: relatively brittle at low citation counts. If
a topic/journal is cited twice by two different
topics/journals, it will have high diffusion.
Broad Impact: Diversity
Topic Diversity: Entropy of the distribution of citing topics
Better at capturing broad end of impact spectrum: the
high diffusion topics are identical to the least frequently
cited topics
Broad Impact: Diversity
Topic Diversity: Entropy of the distribution of citing topics
Topic diversity can also be measured for papers:
Longevity: Cited Half Life
Two views:
• Given a paper, what is the median age of citations to
that paper?
• What is the median age of citations from current
literature?
History: Topical Precedence
Within a topic, what are the earliest papers that received
more than n citations?
Information Retrieval (138):
On Relevance, Probabilistic Indexing and Information Retrieval, Kuhns and
Maron (1960)
Expected Search Length: A Single Measure of Retrieval Effectiveness
Based on the Weak Ordering Action of Retrieval Systems, Cooper
(1968)
Relevance feedback in information retrieval, Rocchio (1971)
Relevance feedback and the optimization of retrieval effectiveness, Salton
(1971)
New experiments in relevance feedback, Ide (1971)
Automatic Indexing of a Sound Database Using Self-organizing Neural Nets,
Feiten and Gunzel (1982)
Sharing: Topical Transfer
Get documents about "