Peter Jacso Savvy Searching (DOC)

Document Sample
Peter Jacso Savvy Searching (DOC) Powered By Docstoc
					The h-index, h-core citation rate and the bibliometric profile of the Scopus database

Final pre-print for Peter Jacso Savvy Searching Online Information Review Volume 35, No 3.

     The h-index has been used to evaluate the research productivity and impact (as manifested by the
number of publications and the number of citations received) at many levels of aggregations for various
targets. It seems appropriate to examine the bibliometric characteristics of the largest multidisciplinary
databases that are the most widely used for measuring research productivity and impact. In this part
preliminary findings are presented about the Scopus database. It is to be complemented and contrasted
by the bibliometric profile of the Web of Science(WoS) database.

      WoS is to be made available in a new version in mid-2011 that will eliminate the 100,000-record
limit for search results. This is an essential prerequisite when using megadatabases with 40-50 million
records for gauging the gauges. While the h-index is considered to be robust (Vanclay, 2007;Henzinger
et al, 2010), i.e. not sensitive to missing records for documents, issues and even volumes of journals,
calculating the h-index and the citation rate for broad subject areas, such as the arts and humanities
domain of knowledge, may produce unrealistically high h-index, and absurdly high citations/item rates
because of non-obvious content and software deficiencies.


     There were nearly 500 papers published since the seminal article (Hirsch, 2005) about the h-index,
the new bibliometric indicator which promised the opportunity to measure and compare through a
single indicator the life-time research productivity and impact of individuals. It is getting more widely
used for creating league lists of prominent researchers in various disciplinary areas and in making
decisions in tenure and promotion application of mere mortals in academia. Later it was extended to
research groups and used at various levels of aggregation, ranging from the departmental to the
college/university and even the country levels.

     In the field of library and information science field alone (and in related fields such as computer
science, information science), there are several examples for calculating the h-index for individuals and
groups of faculty members, departments and institutions (Cronin and Meho, 2006; Prathap, 2006;
Oppenheim, 2007; Meho and Rogers, 2008; Franceschet, 2010; Jacso, 2008e; Levitt and Thelwall, 2009;
Jacso, 2010a, Lazaridis, 2010; Li et al, 2010; Norris and Oppenheim, 2010).

     The h-index and its variants have formed –directly or indirectly- an integral part of the nationwide
research assessment of departments of colleges and universities in Australia and the United Kingdom
(Butler, 2008, Moed, 2008). Country-level metrics about research performance may become an
important element in the standard national measures, but databases have significant limitations in this
regard (Jacso, 2009a and 2009b).

     Many other countries are studying the feasibility of carrying out similar exercises to inform the
decision. The use of h-index was extended to measuring journals’ performance (Braun et al, 2006; Bar-
Ilan, 2010), to scientific topics and compounds (Banks, 2006), and an entire disciplinary area, such as
psychology (Garcia-Perez, 2010).

     Beyond the academic publishing sphere, the h-index and the other metrics are used intensively also
for quotidian but highly practical purposes, informing decisions in library collection management,
especially in justifying the subscription and cancellation of the most expensive academic journals and
other serial publications.

     The h-index for the same individual, department, college, country and journal may be very different
even when created from the same database . There are various reasons for the differences. One that is
hard to measure is the skill of the searchers doing the bibliometric search and their persistence to
discover all the variants of the

         a) name of researchers, departments, colleges, journals and countries, including the
            different abbreviations and punctuations,

         b) incorrect names produced by the authors themselves in their reference lists, and/or
            the data entry operators, and/or the very low-IQ parsing software in case of Google
            Scholar (Jacso, 2009e ).

      There have been far fewer publications about the systematic and comprehensive exploration of
content limitations of the databases that can highly distort the results of routine searches as well as
bibliometric searches, but they were of special importance for librarians and other information
professionals because they focused on the LIS field (Meho and Yang, 2007; Meho and Rogers; 2008; Li et
al, 2010; Norris and Oppenheim, 2010).

     I have been concerned for a long time about the significant shortcomings in traditional databases
(Jacso, 1997), dedicated an entire series of papers discussing the feasibility and reliability of using
various reference enhanced databases directly or with the help of third party utilities (Jacso, 2008a-e;
2009d; 2010a). The lack of specific, quantified information about the cited reference enhanced subset
of databases is of particular concern (Jacso, 2007a-b). Knowing about these limitations is especially
important to contrast reality with the PR claims of database producers (Jacso, 2009c).

    The other reason for the remarkable discordance of the h-index scores calculated from different
databases is the significant difference in the types, breadth and consistency of coverage of source
documents. At least this type of information is searchable, comparable and reproducible in Scopus and
soon in Web of Science in a fairly reliable way on the entire population of 40-50 million records
(depending on the version of WoS licensed).

     Google Scholar may have the broadest source base, and offers an excellent option for topical
searching by virtue of full text searching of several million primary documents, but for bibliometric
purposes its hits count and citation counts are as reliable as the numbers in Bernie Madoff’s profit
     Its massive mishandling of other metadata elements still represent a metadata mega mess even
after fixing some of the errors in millions of records as reported earlier (Jacso, 2008b; Nunberg, 2009) by
creating phantom authors, and making real authors ghost authors and bibliometrically lost authors
(Jacso, 2009e).

     The new version (Version 5) of Web of Science is reported to eliminate the limit of the maximum
100,000 records result set. Scopus did not have limitation for the result set, and has always reported the
total number of hits (but shows “only” the first 2,000 records that matched the query). At the time of
the test this limit was not a problem for determining the h-index and the citation rate/item of the entire
Scopus database.

The bibliometric profile of Scopus

     Elsevier launched this service at the end of 2004, and has been enhancing its content and software
continuously. It also has filled many of the gaps in coverage that I criticized earlier (Jacso, 2008c). It still
claims that it is the largest citation and abstract database. Only half of this statement is true as it is
indeed the largest database in terms of the number of abstracts. It has abstract for more than 69% of its
44.5 million records. The time span of its coverage goes back to 1823. About 47% of the master records
are for publications published before 1996, and 53% from that year onward.

     When it comes to citations, my test result showed that 18.7 million records had one or more cited
references, representing 42% of the entire database content. Except for 27,300 records (predominantly
for papers published in psychology, library and information science and technology, and
multidisciplinary journals before 1996), all of the records enhanced by cited references in Scopus are for
documents published after 1995. To its credit Scopus makes this limitation clear by the warning that
“Scopus does not have complete citation information for articles published before 1996”. It is to be understood that
records for pre-1996 publications do reflect citations received from 1996 onward.

     The ratio of cited reference enhanced records kept slightly increasing year by year from 70% in
1996 to 88% in 2009. There was a small decline in this ratio to 86.5% in 2010. There are only 317,000
records for documents published in 2011 as of mid-March, so it is too early to say if the ratio of cited
reference enhanced records would bounce back to 88% or fall further.

     From the disciplinary perspective, the ratio of records enhanced by cited references is shown on
Figure 2 below for non-science subject areas. It must be taken with more than a grain of salt. Scopus
classifies the journals and other serial sources into 27 broad subject areas by assigning its journals to 21
science disciplines (ranging from Agricultural and Biological Sciences to Veterinary Sciences), 4 social
science disciplines, a single Arts & Humanities (A&H), and/or a Multidisciplinary category. For
comparison, Web of Science assigns its sources to about 150 broad subject areas. It has 17 categories
within the broad category of A&H alone.

     The distribution of records by the broad subject areas can be searched in Scopus using the four-
character code of the subject areas, such as SUBJAREA (psyc) for psychology, SUBJAREA (arts) for A&H,
         Figure 1. Cited reference enhanced records in the Scopus database from 1996 onward

     A journal or a single primary document, of course, may be assigned to more than one subject areas.
This is natural when a journal or book series consistently includes documents covering multiple broad
subject areas, such as Psychology of Music does. Both WoS and Scopus assign journals to more than one
major areas, however, Scopus overdoes this, and it distorts the h-index for the broad subject areas
significantly as will be demonstrated later for the A&H subject area.

     It is also to be noted that there is an UNDEFINED subset (which cannot be searched directly through
a four-character code, but can be computed by excluding all records that have one or more of the 27
other categories assigned). It is to be noted that more than 650,000 records have no subject area
code(s) assigned. This further limits the validity of the h-index and other related metrics for broad
subject areas.
Figure 2. Cited reference enhanced records (in thousands) in the non-science subject areas and their
combinations in Scopus

Key citation metrics for Scopus

      The h-index of the pre-1996 subset of records for the documents published before 1996 is 1,451,
i.e. there are records for 1,451 documents in that subset that were cited more than 1,450 times. This
implies that the total number of citations must be at least 2,105,401 (1,451*1,451). Actually, the total
number of citations received by these 1,451 papers (called the h-core, representing the number of items
that contribute to the h-index) is 4,416,488, producing an average citation rate of 3,044 citations per
item in the h-core of the pre-1996 subset of the entire Scopus database.

                    Figure 3. The h-index of the pre-1996 subset of records in Scopus

    For the subset providing records for 23,455,354 documents published after 1995, the h-index is
1,339, so the total number of citations must be at least 1,792,921 (1,339*1,339). In reality, the total
number of citations received by these papers is 3,903,157, yielding a citation rate of 2,915 citations per
document in the h-core.

     In spite of the higher number of publications in the post-1995 period this is a somewhat lower h-
index and citation rate than for the pre-1996 subset. This is realistic because papers published in the
most recent years may not have reached even their peak year(s) of citations (to be) received, let alone
their entire time span of getting cited - which varies from discipline to discipline. In contrast, the papers
published before 1996 had a minimum of 16 years to attract citations, whereas a paper published in
2000 had only a 12 year time span as of this writing.
                Figure 4. The h-index of the subset of records for post-1995 publications

     For the entire Scopus database of 44.5 million records the h-index is 1,757. This implies that there
are at least 3,087,049 (1,757*1,757) citations received by records forming the h-core of the entire
Scopus database. Actually, the documents in the h-core received 5,922,946 citations from sources
covered by Scopus, yielding a citation rate per item of 3,371.

                           Figure 5. The h-index of the entire Scopus database

     The h-index, of course, will increase constantly. This simple method of determining the h-index of
the entire database by jumping to the neighborhood of the 1750th record in the list sorted by decreasing
order of citedness will work until the h-index reaches 2,000, the maximum number of records that can
be displayed in Scopus. However, Scopus offers a much more generous option by downloading 20,000
records within a few hours after the search. The users are informed in e-mail when the set becomes
available, and they have to start a new session or have an active session, to download the set. It is
worth the wait because the data can be imported into a spreadsheet to determine the h-index by
scrolling down the result list sorted by decreasing order of the number of citations received.

     This excellent option allows the users to calculate other bibliometric indicators such as the
average citation(s) received by the most cited 20,000 documents, or the citations/document from the
set of records that make up the h-core, such as the g-index (Egghe, 2006).
     While the h-index ignores all the citations received by a document beyond the number needed to
belong to the h-core, the g-index does take into account the total number of citations received by those
items in the h-core. As Egghe defines it, the g-index is “the highest number g of papers that together
received g2 citations”.

The h-index and the h-core subset’s citation rate for the A&H subject area of Scopus

     Determining the number of documents, their h-index and other bibliometric indicators by subject
areas must be done with great care and skepticism because journals and conference proceedings, and
thus articles, reviews, and conference papers can be assigned to multiple subject areas and counted for
each. For example, as of mid-March, 2011 Scopus reported to have nearly 900,000 records for
documents in the A&H subject area. This was a seemingly phenomenal progress in the coverage of this
subject area.

     Less than two years earlier, when Scopus announced the coverage of this subject area, it reported
to have records for 333, 400 documents in Arts & Humanities . It also claimed, and still claims in its PR
materials that “Scopus is the most holistic cross-disciplinary database with the broadest coverage in Science,
Technology, Medicine, Social Sciences and soon also in the Arts & Humanities”. (WoS had more than 3 million
items at that time for A&H, now it has very close to 4 million, and a h-index of 190).

     I strongly disagreed with this very misleading claim (Jacso, 2009c). This “broadest coverage” exists
only in the mind of the copy writer, who must have come with a strong background in late night
commercials, and in supermarket ads about miracle tools for improving abdominal muscles (abs) in one

     Beyond this disappointing abs ads culture in a highly intellectual context, nearly tripling the
coverage of the A&H subject area in less than two years is incredible, and so are the h-index of 312, and
average citation rate of 2,550 per item in the h-core subset for this subject area. For the naïve users this
may mean that Scopus has far better coverage of primary sources with far stronger impact and citation
rates in the A&H subject area than WoS.

     The non-naïve users know that this subject domain has far the lowest journal citation rates of all
disciplines, simply because its citation culture is based much more on books than on journals. Actually,
the A&H segment in both Scopus and WoS has the pot belly symptom. The question is how can Scopus
come up with this much better h-index for the A&H subject area?

     Looking at the first few dozen records in the result set sorted by decreasing order of citations will
give the clue (Figure 5). The query SUBJAREA(arts) produces 875,438 hits. It is quite clear from the very
top of the result list that these items, are not really related to the A&H subject areas. Some articles in
journals of psychology, pharmacology, neuroscience, engineering certainly can relate to A&H, but not
to the tune of hundreds of thousands, or tens of thousands as is clear from the re-designed side-bar
cluster in the recent release of the software platform of Scopus. Especially odd is the 399, 580 hits from
journals of pharmacology, toxicology, and pharmaceutics that are reported by Scopus to be a subset of
the A&H set retrieved.
                    Figure 5. Excerpt of the list of top cited papers attributed to A&H

     Even on a smaller scale, the assignment of too many subject area codes to journals and documents
can significantly inflate the citation metrics. This symptom is present in the full result set when
searching for the A&H subject category, and this massively misrepresents its metrics. Out of the top
hundred results 81% are papers published in Psychological Review or Psychological Bulletin. This ratio is
152 in the top cited 200 subset, and 259 in the 312 documents that make up the h-core subset of the
A&H subject area in Scopus.

     Most of these are not related to any sub-disciplines of A&H, as opposed to publications in some
other bi-disciplinary or multidisciplinary journals, that indeed belong to two or more major disciplinary
areas such as the Psychology of Music mentioned earlier, Pastoral Psychology, Applied Psycholinguistics,
Journal of Theoretical and Philosophical Psychology, Theory and Psychology, Political Psychology,
Philosophical Psychology, International Journal of Psychology of Religion, Journal of Humanistic
Psychology, and the Journal of Psychology and Theology.

     For regular searches it is unlikely that users would search by one of the major subject areas alone. It
is more likely that the smarter ones will use the subject area codes to refine a search term, such as
depression AND subjarea [psyc], when results show up from journals of earth and planetary sciences,
materials science, and chemical engineering for the search term without qualification.

     Knowing the biblometric features of databases, their own h-index and related metrics versus those
of the alternative tools can be very useful for computing a variety of research performance indicators.
However, we need to learn much more about our tools, in our rush to metricize everything before we
can rest assured that our gauges gauge correctly or at least with transparent limitations.
    For example, for a college or university that provides education only in the A&H subject areas
(Anthropology, Linguistics, Religion, Ethnic Studies, Philosophy) the dilemma is to license the Scopus
database that comes in one version, with good coverage of science disciplines but poor coverage of the
A&H domain or just license the A&H subset of WoS. Learning the bibliometric profile of the tools used to
measure the research performance of researchers, departments, universities and journals can help to
make better informed decisions, and to discover the limitations of the measuring tools.


Banks, M.G. (2006), "An extension of the Hirsch index: indexing scientific topics and compounds", Scientometrics,
Vol. 69 No.1, pp.161-8.
Bar-Ilan, J. (2010), “Rankings of information and library science journals by JIF and by h-type indices”, Journal of
Informetrics, Vol. 4 No. 2, pp. 141-7. doi:10.1016/j.joi.2009.11.006.
Braun, T., Glänzel, W. and Schubert, A. (2006), "A Hirsch-type index for journals", Scientometrics, Vol. 69 No.1,
Butler, L. (2008), “Using a balanced approach to bibliometrics: quantitative performance measures in the
Australian Research Quality Framework”, Ethics in Science and Environmental Politics (ESEP), Vol. 8 No. 1, pp. 83-
Cronin, B. and Meho, L. (2006), "Using the h-index to rank influential information scientists", Journal of the
American Society for Information Science and Technology, Vol. 57 No.9, pp.1275-8.
Egghe, L. (2010), “The hirsch-index and related impact measures”, Annual Review of Information Science and
Technology, Vol. 44, pp.65-114. Cronin, ed. Medford, NJ: Information Today, Inc.
Egghe, L. (2006), “An improvement of the h-index: the g-index”, ISSI Newsletter, Vol 2 No 1, pp. 8-9.
Franceschet, M. (2010), “A comparison of bibliometric indicators for computer science scholars and journals on
Web of Science and Google Scholar”, Scientometrics, Vol. 83 No.1, pp. 243-58.
Garcia-Perez, M.A. (2010), “Accuracy and Completeness of Publication and Citation Records in the Web of Science,
PsycINFO, and Google Scholar: A Case Study for the Computation of h Indices in Psychology”, Journal of the
American Society for Information Science and Technology, Vol. 61 No. 10, pp. 2070-85.
Henzinger, M., Sunol, J. and Weber, I. (2010), “The stability of the h-index”, Scientometrics, Vol. 84 No. 2, pp. 465-
79. doi:10.1007/s11192-009-0098-7.
Hirsch, J.E. (2005), "An index to quantify an individual's scientific research output", Proceedings of the National
Academies of Science, Vol. 102 No. 46, pp.16569-72.
Jacsó, P. (2006), "Deflated, inflated and phantom citation counts", Online Information Review, Vol. 30 No. 3,
Jacsó, P. (1997), "Content evaluation of databases", Annual Review of Information Science and Technology, Vol.
32, pp.231-67.
Jacsó, P. (2007), "How Big Is a Database versus How Is a Database Big”, Online Information Review, Vol. 31 No. 4,
pp. 533-6.
Jacsó, P. (2007), "The dimensions of cited reference enhanced database subsets", Online Information Review, Vol.
31 No.5, pp. 694-705.
Jacsó, P. (2008), "The Plausibility of Computing the H-index of Scholarly Productivity and Impact Using Reference
Enhanced Databases”, Online Information Review, Vol. 32 No. 2, pp. 266-83.
Jacsó, P. (2008), "The Pros and Cons of Computing the H-index Using Google Scholar”, Online Information Review,
Vol. 32 No. 3, pp. 437-52.
Jacsó, P. (2008), "The Pros and Cons of Computing the H-index Using Scopus”, Online Information Review, Vol. 32
No. 4, pp. 524-35.
Jacsó, P. (2008), "The Pros and Cons of Computing the H-index Using Web of Science”, Online Information Review,
Vol. 32 No. 5, pp. 673-88.
Jacsó, P. (2008), "Testing the calculation of a realistic h-index in Google Scholar, Scopus and Web of Science for
F.W. Lancaster", Library Trends, Vol. 56 No. 4, pp. 784-815.
Jacsó, P. (2009), "Errors of Omission and their Implication for Computing Scientometric Measures in Evaluating the
Publishing Productivity and Impact of Countries”, Online Information Review, Vol. 33 No. 2, pp. 376-85.
Jacsó, P. (2009), "The H-index for Countries in Web of Science and Scopus”, Online Information Review, Vol. 33 No.
4, pp. 831-7.
Jacsó, P. (2009), "Database Source Coverage: Hypes, Vital Signs and Reality Checks”, Online Information Review,
Vol. 33 No. 5, pp. 997-1007.
Jacsó, P. (2009), "Calculating the H-index and Other Bibliometric and Scientometric Indicators from Google Scholar
with the Publish or Perish Software”, Online Information Review, Vol. 33 No. 6, pp. 1189-200.
Jacsó, P. (2009), "Google Scholar’s Ghost Authors and Lost Authors, Library Journal Vol. 134 No. 18, Nov 1, 2009,
pp. 26-27.
Jacsó, P. (2010a), “Metadata mega mess in Google Scholar”, Online Information Review, Vol. 34 No. 1, pp. 175-191.
Jacsó, P. (2010b), "Pragmatic issues in calculating and comparing the quantity and quality of research through
rating and ranking of researchers based on peer reviews and bibliometric indicators from Web of Science, Scopus
and Google Scholar”, Online Information Review, Vol. 34 No. 6, pp. 972-82.
Lazaridis, T. (2010), “Ranking university departments using the mean h-index”, Scientometrics, Vol. 82 No. 2, pp.
211-6. doi:10.1007/s11192-009-0048-4.
Levitt, J.M. and Thelwall, M. (2009), “The most highly cited Library and Information Science articles:
Interdisciplinarity, first authors and citation patterns”, Scientometrics, Vol. 78 No. 1, pp. 45-67.
Li, J.A., Sanderson, M., Willett, P., Norris, M. and Oppenheim C. (2010), “Ranking of library and information science
researchers: Comparison of data sources for correlating citation data, and expert judgments”, Journal of
Informetrics, Vol. 4 No. 4, pp. 554-63. doi:10.1016/j.joi.2010.06.005.
Meho, L.I. and Yang, K. (2007), "Impact of data sources on citation counts and rankings of LIS faculty: Web of
Science versus Scopus and Google Scholar", Journal of the American Society for Information Science and
Technology, Vol. 58 No.13, pp. 2105-25. Available at:
Meho, L.I. and Rogers, Y. (2008), “Citation counting, citation ranking, and h-index of human-computer interaction
researchers: A comparison of Scopus and Web of Science”, Journal of the American Society for Information Science
and Technology, Vol. 59 No.11, pp. 1711-26. doi:10.1002/asi.20874.
Moed, H. F. (2008), “UK Research Assessment Exercises: Informed judgments on research quality or quantity?,
Scientometrics, Vol. 74 No 1, pp. 153-61.
Norris, M. and Oppenheim, C. (2010), “Peer review and the h-index: two studies”, Journal of Informetrics, Vol. 4
No. 3, pp. 221-32.
Nunberg, G. (2009), “Google’s Book Search: a disaster for scholars”, The Chronicle of Higher Education, web
edition available at:
Oppenheim, C. (2007), "Using the h-index to rank influential British researchers in information science and
librarianship", Journal of the American Society for Information Science and Technology, Vol. 58 No. 21, pp. 297-
Prathap, G. (2006), "Hirsch-type indices for ranking institutions' scientific research output", Current Science, Vol.
91 No.11, pp.1439.
Vanclay, J.K. (2007), "On the robustness of the h-index", Journal of the American Society for Information Science
and Technology, Vol. 58 No.10, pp.1547-50.

Shared By: