					                    Internet-Based Research in the Social Science of Religion
                                    William Sims Bainbridge

        For a decade, social scientists have been aware that much religion-oriented
communication takes place on Internet (Hadden and Cowan 2000). During that time, the amount
of activity online has increased greatly, and the forms of Internet usage have diversified
seemingly without end. It is also true that scientists have discovered new ways to extract data
from websites or other Internet-based systems, even when they are not explicitly religious, that
can benefit researchers interested in religion. No longer is the task merely studying the
innovative ways people can use Internet for religious purposes. It is now also possible to use
Internet-derived data to develop and test general theories of religious behavior that apply offline
as well as online.
        This paper will describe Internet-based research methods that are cutting-edge, meet
reasonable tests of validity and reliability, and are sufficiently practical that students can use
them for graduate papers and dissertations at the same time that their professors are preparing
professional publications based on them. The emphasis will be on quantitative methods, but
some qualitative methods will also be mentioned, in part to place the quantitative techniques in a
wider methodological context, as well as to identify directions in which innovations might be
developed. At the outset, we can identify seven general principles:

       1. Internet based research can employ traditional techniques of social-science research,
       and can adapt those methods in fresh ways.

       2. Entirely new valid methodological approaches can also be developed, sometimes with
       only the most tenuous or metaphoric relations to earlier methods.

       3. To maximize both innovativeness and efficiency, collaborations between social
       scientists and computer scientists are often necessary.

       4. Even when working collaboratively with computer scientists, a social scientist needs to
       develop a significant expertise managing Internet data, including even some
       programming knowledge, but this is actually not difficult to achieve.

       5. Working with existing data collected from Internet, or with new data collected by an
       innovative online system, will require the social scientist to pay more attention to issues
       of data management than is common in more traditional contexts.

       6. The best results will come from studies that carefully but aggressively address
       methodological and theoretical issues together, realizing that the most important
       challenges and opportunities require deep thinking about both, and that insights from one
       can inform the other.
       7. Internet-related technologies and their social applications are in constant flux, so
       researchers should be looking for new possibilities, and the examples offered here are
       meant to inspire rather than constrain scientific creativity.

         Collaborations between social scientists and computer or information scientists will
require both sides to gain appreciation of the other's point of view. Social scientists in particular
will need to realize that many of the very best computer scientists conceptualize science very
differently, particularly without the same kind of dedication to theory and zeal in comparing
comparing competing theoretical positions that social scientists love. One example will suffice,
an excellent recent computer science article about religion and information technology: "Re-
Placing Faith: Reconsidering the Secular-Religious Use Divide in the United States and Kenya"
by Susan P. Wyche, Paul M. Aoki, and Rebecca E. Grinter.
         Before we even consider the topic, it is important to note that this is a conference paper,
given at CHI 2008 in Florence, Italy. Conferences play an almost totally different role in
computer science from the role they play in social science, and CHI is the most prestigious and
influential scientific gathering on the relationships between human beings and information
technology. It is the annual conference of SIGCHI, the special interest group on human factors
in computing of the Association of Computing Machinery. Giving a paper at CHI is like getting
one published in Social Forces for a sociologist, but the publication is immediate, rather than
waiting a year or two as with social science paper journals. A social scientist who wants to
collaborate with computer scientists will need to adapt to the rough and rapid, but still seriously
reviewed, publication system in computer science.
         Another characteristic of this article that requires some adjustment on the part of social
scientists is that it seems to have a very practical focus, rather than being motivated by the desire
to test abstract theory. Noting the continuing and perhaps increasing significance of religion, and
the possibility that secular populations make greater use of information technology, the
researchers have carried out a series of studies to understand how information technologies could
be better designed to serve the distinctive needs of highly religious people, indeed to serve some
of their religious needs (Wyche et al. 2006, 2009a, 2009b). For example, in this study the
researchers discovered that religious people often want to remember points that were made in an
especially inspirational Sunday church sermon, and so they developed a note-taking system
using mobile phone technology to help them accomplish this in a versatile, convenient, and cost
effective manner.
         A third characteristic of the study is that the investment in varied aspects of the
methodology has a very different balance from what we would expect to see in a professional
social scientific study. The research team collected data in both Atlanta, Georgia, and Nairobi,
Kenya, at great effort, but did so through somewhat unstructured interviews and ethnographic
observation with small numbers of individuals. This is standard in the field of human-computer
interaction research. The goal is to understand in depth what can be learned from people who act
as key native informants and who invest much of their own effort in the study, but without any
concern over what fraction of the general population these people represent. Their function is to
inspire innovation among the computer scientists, who design new technology through a sort of
collaboration with their research subjects.
         In this case of this fine study, the result is a contribution not only to knowledge, but even
more importantly to the existing store of design ideas from which technologists may draw, and a
contribution to the people of faith who will use future information technology designed to serve
religious purposes. For computer scientists, theory tends to mean one of two things. First of all,
it refers to mathematical theory typically concerning methods of calculating algorithms. The
criterion of good theory by this definition is that it guides calculations that are both swift and
accurate. Second, theory in the human-centered computing area really refers to design principles
to guide the creation of new technologies to serve specified human needs. In this case, the
computer scientists draw intelligently upon some social science of religion concepts, and they
accomplish good ethnography of Kenyan religious and community culture, but in the service of
future technologies to benefit religious people, rather than to frame abstract theories about
         One more feature of this study deserves mention as background for the present paper,
namely that it studies information technology broader than the term "Internet" would cover. The
people in Atlanta used Internet, but those in Nairobi used cellphones and text messaging over the
phones. Technically, Internet refers to a data communication network that uses the TCP/IP
protocol, but much of what you can access through Internet is not really native to it and may
originally use other technologies. The World Wide Web is a subset of the billions of files
reachable over Internet, those formatted with the Hypertext Markup Language (HTML), and
within the Web there are many files belonging to the Deep Web that cannot be accessed by
search engines because they are behind password protection or other barriers. Just as the Web is
a subset of Internet, Internet is a subset of The Net, which comprises all forms of electronic
communication. Already barriers are breaking down between traditional electronic media, and
the distinctions between radio and podcasts, television and YouTube, telephone and Skype are
historical anachronisms. Thus, while this paper will emphasize data that can indeed be accessed
over the current Internet, the reader should be alert to the fact that realities and definitions are
changing rapidly, and all modes of electronic communication are currently converging.
         Here we shall emphasize the usual social-scientific concerns with theory and methods,
more than technological results, but remain mindful of the somewhat different priorities of the
computer scientists who provide us with the needed technologies. We shall consider different
kinds of Internet-based research under six rough headings, arranged from work that is most
similar to traditional quantitative social-scientific methodologies, to work that is least similar but
still connects directly to the kinds of theories that social scientists have addressed for many
years. We begin with online questionnaires, which draw upon a century of survey research
traditions, then turn to recommender systems, which are very new but similar in many respects to
questionnaires. Geographic data analysis also has a century-old tradition in the social sciences,
but new sources of georeferenced data can be found online today. Although everybody is
familiar with search engines like Google and Alta Vista, they can be used in a number of ways to
collect data that can be analyzed in several ways, and more advanced natural language
processing methods naturally build on familiar features of search engines. New areas where old
theories can be applied include cultures inside virtual worlds.

1. Online Questionnaires

        Computers have been used to administer questionnaires for many years, but mass
administration online directly to respondents waited until the World Wide Web gained popularity
in the mid-1990s. Perhaps the most important traditional application before then was in
computer-assisted telephone interviewing. As pioneered by the US Census in 1790 and perhaps
rather earlier around the year 0 when Caesar Augustus sent agents to count the population of the
Roman Empire so it could be taxed, interviewers had long asked standardized questions verbally,
writing down the responses themselves rather than requiring the respondent to do it. I have seen
rough estimates that perhaps ten percent of the adults in the Roman Empire could read and write,
so most could not have filled out a paper questionnaire, but we should be mindful of the fact that
some people in modern societies cannot do so either, and each technology excludes at least some
potential respondents.
        Using a computer to do telephone interviewing has several advantages, some of which
transfer to online questionnaires. The interviewer reads the questions from the screen, and enters
the response with a single keypress or mouse click, or in some cases typing in the word or phrase
the respondent speaks. The computer automatically moves to the next question, saving the effort
of manually turning a page, and it can jump to contingent questions that might confuse the
interviewer and would often confuse respondents if the questionnaire were on paper. A common
example is questions about religious affiliation. Are you Catholic, Protestant, Jewish, Other, or
None? People who selected "Protestant" are then often asked to define exactly which Protestant
denomination they belong to, something one would not bother asking a Roman Catholic.
        Among the greatest advantage of computer-assisted interviewing is that it skips the often
laborious process of entering responses from a paper questionnaire into a computer for analysis.
It should be recalled that while computer-administration may be relatively new, computer
analysis is quite old, arguably dating from Hollerith's work on the 1900 census and even earlier
(Bainbridge 2004c).
        As Donald Dillman (2002) has noted, Internet-based questionnaire surveys are now one
of the important research options, and their disadvantages are somewhat reduced by the
increasing difficulty of getting good samples for telephone surveys. The chief issue for Internet-
based questionnaires is that their data will not be representative of the population as a whole,
both because many people – especially in some subgroups – will not have Internet access, and
because many people will refuse to answer a questionnaire online when invited to do so.
        However, I would argue that conceptualizing online questionnaires in terms of traditional
survey research is too limiting. As I understand the term, survey is not synonymous with
questionnaire. Rather it refers to an attempt to collect new data that are representative of the
population of interest. Conceivably, a survey could be done without asking any questions, for
example visiting a random sample of rural homes to visually determine what fraction of them
had indoor flush toilets. For at least two reasons, sociologists and political scientists had gotten
in the habit of assuming that every proper questionnaire needed to be administered to a random
        The first reason was descriptive. If the goal is to describe a population, then a census is
the methodologically best method, but cost concerns often rule that out. A simple random
sample, if it is large enough, should accurately represent the population. Furthermore, if
nonresponse bias is also random, then it is possible to use statistical techniques to estimate the
sampling errors. Unfortunately, nonresponse biases are not random, and increasing fractions of
the population refuse to be surveyed, or simply cannot easily be located. Face-to-face
administration tends to get the highest response rate, but is exceedingly costly. Thus, a national
questionnaire like the General Social Survey will use a cluster sampling technique, to minimize
interviewer travel costs, and tends to advise against uncritical application of tests of statistical
significance which assume simple random samples. In his textbooks in research methodology,
Earl Babbie (2004) has been advising students that tests of statistical significance are not really
appropriate in sociological research, a controversial point but one that clearly highlights the
        Descriptive accuracy primarily serves journalistic, political, and policy purposes, rather
than scientific ones concerned with discovering and testing general theories. Polling research
earned the social sciences much prestige in the wider world, by offering insights and advice that
were credibly based on rigorous, scientific methodology. Journalists want to be able to say what
is happening to "people" in their society, or to the society in general. Politicians want to know
what the electorate thinks about the issues of the day, so they require a random sample of voters
– or of those mythical beasts, the "likely voters." Policy makers similarly need to know what is
happening to "the American family" or "the average citizen."
        When the General Social Survey was launched back in 1972, it was an expression of the
Social Indicators Movement that hoped to use the GSS to monitor conditions in the United States
so that policy makers could adjust government regulations and programs for maximum benefit.
Using questionnaire surveys as social indicators to guide government policy assumes a lot about
the way governments and particular political parties actually function, and for most of the years
since the birth of the GSS, sociological surveys were simply not a significant part of US
government decision-making.
        The second reason why representative samples are preferable is related to the fact that
social statistics tend to assume simple random samples, but goes a bit deeper than that.
Hopefully simple random samples minimize the possibility that the correlation between two
variables is the spurious result of other variables, or that the lack of a correlation results from a
real relationship that is masked by some unmeasured suppressor variable. This is a debatable
point, but in practice I suggest that many social scientists take this idea for granted without even
noticing it. Consider a random sample of the United States. Typically, as in the case of the GSS,
the sample leaves out "institutionalized" populations, children, Americans living abroad (or in
the armed forces), and perhaps the underclass and undocumented immigrants. But even if you
could get a true random sample of Americans, you would not have a random sample of human
beings. Americans are 5 percent of the world, and the current world is perhaps 5 percent of all
the humans who have ever lived. Thus there is a huge selection bias, and crucially for this point,
that bias may correlate with variables of interest.
        Rather then relying upon a random sample to limit spuriousness and suppression, which it
may not really do very well anyway, a better choice is replication. There are really two
functionally related ways to accomplish this. External replication means giving a questionnaire
to members of very different groups, to see if the results carry over from one to another. Internal
replication is accomplished when the use of subsamples or statistical techniques controlling for
additional variables accomplishes the same thing within a single dataset. Under favorable
conditions, both of these can be accomplished with online questionnaires, if effort is invested to
get a very large and diverse set of respondents, affording many opportunities for internal
replication, and if one is prepared to replicate key findings by some other method. Perhaps the
most famous pre-Internet example of external questionnaire replication is the Glock and Stark
(1966) study that initially surveyed Northern California church members in a very limited
geographic area, then subsequently replicated key findings with a national sample.
        While there were good reasons for giving high priority to sampling with pre-Internet
questionnaires, this inescapably gave lower priorities to other values, notably item quality and
topic coverage. In the 1950s and 1960s, much more effort was invested in item-creation than
today, especially in development of multi-item and often multi-factor measurement scales. An
expensive national survey often cannot afford to include many items on a single topic, and the
ones that are included need to be intelligible to everybody. Thus, they are written to a "lowest
common denominator" standard, rather than reflecting the complexity and nuance of theoretical
debates in the social sciences of religion.
        If the aim is to study a small subgroup of the population, such as atheists, then one will
need either a huge sample, or a carefully targeted one, each of which might be achieved over
Internet (Bainbridge 2005). Cost considerations and the fact that the average person has no
opinion on many of the topics of interest to social science also militate against research on a
wide range of topics that are relevant only to subgroups within the population. This is especially
worrisome when the research concerns social and cultural change, because many new
phenomena will be unknown to the majority of respondents in a random sample of the general
population. Online questionnaires can address these issues in a number of ways, beginning with
where the items come from in the first place.
        At a first approximation, the material for questionnaire items can come from two very
different sources: (1) existing theory expressed in the publications of social scientists, or (2) the
experiences, beliefs, and behavior of the non-scientists we wish to study. In general, I do not
favor survey researchers writing items out of their own imaginations, as they sit in their
academic armchairs, but I advocate going through a serious process of discovery beyond the
boundaries of their own personal experience. My favorite classic example of items derived from
existing theory is the Mach Scale developed out of the works of Italian political theorist Niccolò
Machiavelli (Christie and Geis 1970). A number of statements were derived directly from
Machiavelli's publications, then augmented with a few others that expressed ideas that were in
his works but not stated so simply. A large collection of these items were administered to
college students, in a lengthy iterative process, and then statistical techniques were used to
develop a high-reliability 20-item scale, containing a couple of subscales. This Mach Scale was
then used in a wide variety of studies with different populations, which had the effect of
determining its generalizability beyond the original student population.
        Classical scale-construction work like this in personality and social psychology inspired
me to launch an Internet-based project in 1997, called the Question Factory. I posted a number
of online questionnaires consisting of open-ended items, asking people to express their views on
some topic. One asked, "Imagine the future and try to predict how the world will change over
the next century. Think about everyday life as well as major changes in society, culture, and
technology." After successful preliminary work with The Question Factory, this item was
included in the pioneering web-based questionnaire, Survey2000, organized by sociologist James
Witte and sponsored by the National Geographic Society (Witte, et. al. 2000). Approximately
20,000 respondents gave thoughtful written responses to this item, from which I was able to cull
2,000 distinct predictions, 100 of them about religion (Bainbridge 2003, 2004b, 2004d).
        A very more recent example, not directly about religion but it easily could have been, is
part of a doctoral dissertation about World of Warcraft (WoW) by a British student named Jane
Barnett (Barnett et al. in press). The focus was how people conceptualized anger and the
behaviors that made them angry, in this online virtual world. Barnett began, using online forums
and email rather than an online open-ended questionnaire, by eliciting examples of in-WoW
scenarios that had made 33 thoughtful respondents angry, and she edited and combined these to
produce a battery of 93 provisional items. Hundreds of other respondents rated them in terms of
how angry these behaviors would make them feel, and an interactive process employing factor
analysis and scale reliability measures reduced them to a 28-item scale with four subscales. One
finding that might be relevant to the social science of religion is that people become angry at
other people's negative behavior, regardless of whether that behavior was intended to harm. This
reminds us that the moral codes promulgated by religions may not directly relate to the cognitive
and emotional processes that determine people's senses of anger or appreciation.
        Once one has questionnaire items, one needs respondents. One of the factors that made
Survey2000 a success was the fact it was sponsored by the National Geographic Society, and the
NGS publicized the questionnaire on its website and in its main magazine. About 50,000 people
completed the questionnaire, most in the United States and Canada, but with at least 100
respondents from each of 33 other nations.
        A year later, the NGS helped publicized Survey2001, which actually consisted of separate
questionnaires for adults and children, and the adult questionnaire was administered online in
four languages. Readers of National Geographic magazine have diverse interests, but they are
probably far more aware of environmental and global issues than the average person. Thus many
of the topic areas were salient for most respondents, even though they were not a random sample.
Many items were organized in topical modules, and each respondent was given one at random.
After completing it, the respondent was given the choice of doing another one, also selected by
the computer at random. Again, this process trades representativeness of the sample against
salience of the items for the respondent, but analysis of the data showed great diversity of
opinion among respondents to any module. Given the very large number of respondents over-all,
each module obtained many responses, and the article on the New Age I published in Journal for
the Scientific Study of Religion (Bainbridge 2004) was based on fully 3,909 English-speaking
respondents to the module I included in Survey2001.
        Teen-age respondents to the youth questionnaire in Survey2001 were recruited in two
very different ways. First, many were recruited off the National Geographic website. Second,
others filled out the questionnaire as a school assignment connected with Geography Awareness
Week. Teachers were recruited so that two classes did the questionnaire in each US state and
province of Canada. The fact that these two methods obtained very different kinds of
respondents, permitted internal replication, and in one study I compared gender correlations with
1,191 respondents in each group (Bainbridge 2002).
        Inviting respondents is not the same thing as motivating them, and motivational factors
will vary depending on the nature of the population and the topic of the research. A study by
Dmitri Williams and his collaborators (Huh and Williams in press) is a marvelous example of
how motivation and salience can combine with opportunities to collect additional data online to
supplement a questionnaire. His study is part of a massive effort focused on the virtual world (or
online multiplayer role-playing game) EverQuest II. The Sony company, which created
EverQuest II, provided access to the raw data on its computer servers, documenting millions of
social and economic interactions between the avatars of the users. A random sample of players
was then sent an invitation to complete an online questionnaire, and offered a highly valuable
virtual object as payment, achieving a very high response rate. The questionnaire included a
well-developed battery of items about motivations for being in EverQuest II, as well as objective
questions about the respondent such as his or her gender. It was then possible to connect the
questionnaire responses to the characteristics and behavior of the avatars, for example comparing
the gender of the person and his or her avatar, and comparing the degree of aggressiveness across
both the real and virtual genders.
        Another study that shows how online methodological innovations can achieve scientific
gains was done in Japan and published in American Journal of Political Science (Horiuchi et al.
2007). This study combined a questionnaire with a randomized assignment experiment, and
employed analytical innovations as well. One of the issues in the 2004 election to the upper
house of the Japanese legislature was pension reform. Three questionnaires were used at
different stages in the process: respondent screening, pre-election attitudes, and post-election
attitudes. The sample was randomly assigned to one of three groups: (1) those asked to visit the
website of one the two main political parties, (2) those asked to visit the websites of both parties,
and (3) those not asked to visit any website and not given the pre-election questionnaire. Of
course, the main comparisons concerned responses to the postelection questionnaire. Random
assignment to the treatment groups and the control group is of course a traditional method used
by experimentalists to get around biases introduced by non-random samples of respondents.
This study underscores the tremendous possibilities for methodological innovation, building on
traditional methods, which Internet offers.

2. Recommender Systems

        A vast amount of information about modern culture lies latent in the databases of
commercial websites in what are usually called recommender systems (Resnick and Varian 1997;
Basu et al. 1998) but also sometimes referred to as collaborative filtering systems (Goldberg et
al. 1992; Canny 2002). With the growth of online merchandising, websites have invested heavily
in recommender systems of many kinds that advertise to a user products the merchant thinks that
particular individual might want to buy. A vast scientific literature now exists concerning
recommender systems, but essentially all of it is oriented toward making predictions of customer
preferences, rather than exploring how these systems could be used as social science research
tools (Herlocker et al. 2004). The most obvious way to use recommender systems to do social
science research on religion is to examine what religious movies or books cluster together, based
on preference correlations across large numbers of cases, employing statistical techniques almost
identical to the ones we have been using for decades with questionnaire data.
        In some cases, such as the one for the Netflix movie rental company, the system actually
uses a simple questionnaire. People who rented movies are invited to rate them on the website,
using a five-step scale. Then the system uses statistical methods to predict which other movies
the individual might want to rent, based both on that individual's expressed preferences, and the
preferences of other people whose preference patterns are similar. The Internet Movie DataBase
is not a rental company, but it also encourages people to rate movies, using a ten-point
preference scale. We will use some data from these two sources to illustrate typical research
procedures, admittedly on a much smaller scale than a real research project would use.
        The Internet Movie DataBase has a category called "based on the Bible," including 10
theatrical-release films that were rated on a scale from 1 to 10 by at least 1,000 persons.1 Of
these, seven are also in the NetFlix database, and listed here in Figure 1. The IMDB data are
available for anyone to see on its website, whereas the NetFlix figures come from analysis of the
raw data, which were distributed to anyone who wished to register as a contestant in the first
NetFlix contest, designed to see if anyone could create a better algorithm for predicting people's
preferences. The contest data consisted of 17,770 separate text files representing an equal
number of movies, and some effort was required to get these data in shape for analyzing.

              Figure 1: Seven Bible-Related Movies in Two Recommender Systems
                                      IMDB IMDB NetFlix NetFlix
                                      Raters Mean Raters Mean
 The Ten Commandments (1956)          18,481    7.9 20,910  3.9
 The Last Temptation of Christ
 (1988)                                18,628       7.5   12,739         3.4
 The Prince of Egypt (1998)            21,568       6.8   16,664         3.7
 Jonah: A VeggieTales Movie
 (2002)                                 1,585       6.4     7,775        3.6
 The Greatest Story Ever Told
 (1965)                                 2,976       6.3     3,180        3.6
 The Bible: In the Beginning…
 (1966)                                 1,179       5.7       955        3.3
 Left Behind (2000)                     3,816       4.6     4,646        3.3

         A quick look at the work preparing the NetFlix data can illustrate the need for data
management skills on the part of researchers. Each of the text files contained a long series of
short lines, each one representing the response by one person. Here are the first five lines of the
file for The Ten Commandments:


         The first number is an ID code representing the respondent; this is crucial, because it
allows the researcher to combine the data for different films rated by the same person. The total
number of respondents in the dataset is 400,000, but the ID numbers go considerably higher, one
of the little details of which the researcher needs to be aware when preparing to assemble the
dataset. The second, one-digit number, between the two commas, is the actual preference rating
for that respondent and film, a number from 1 (did not like) to 5 (liked very much). The last part
of each line is the date on which the person rated the film. The file for The Ten Commandments
has fully 20,910 such lines of data.
         Simply put, there are two ways to combine the necessary datafiles: (1) do it manually,
using whatever standard tools one is already familiar with, or (2) write a computer program
specially designed for the particular project. I use both methods, and generally find that I need to
do a little manual work before I really understand what features need to be coded into a program
that will do the "heavy lifting" for me.
         For example, using an ordinary word processor and spreadsheet, I manually combined the
data for the first three very popular films: The Ten Commandments, The Last Temptation of
Christ, and The Prince of Egypt. The first two films are live-action epics depicting portions of
the Old Testament and New Testament, respectively. The Prince of Egypt is a cartoon remake of
The Ten Commandments, even adopting the same debatable assumption that the pharaoh Moses
dealt with was Ramses the Great. The two movies about Moses treat the subject reverently,
whereas The Last Temptation of Christ was a very controversial film, based on a controversial
novel by Nikos Kazantzakis, as its Wikipedia page explains: "Like the novel, the film depicts the
life of Jesus Christ, and its central thesis is that Jesus, while free from sin, was still subject to
every form of temptation that humans face, including fear, doubt, depression, reluctance and lust.
This results in the book and film depicting Christ being tempted by imagining himself engaged
in sexual activities, a notion that has caused outrage from some Christians."2
         Thus these three films nicely illustrate ways in which works of popular culture may differ
along various dimensions. The word processor was used to replace the commas with tabs, so
that the data would automatically go into the correct columns when loaded in the spreadsheet.
Then a good deal of manipulation – the equivalent of programming by putting IF-THEN
statements into spreadsheet cells and doing several sortings – was required to get the data in
shape for analysis both in the spreadsheet itself and after transfer to the SPSS statistical analysis
software. For larger numbers of films, one would want to invest the effort to write a program
that could combine hundreds of files automatically.
         Of the total 42,572 respondents, 35,617 rated only one of these three movies, 6,169 rated
two, and 786 rated all three. This suggests researchers will need to deal with challenges of
missing data, but that whenever Internet provides very large numbers of cases for statistical
analysis, a sufficient number will connect any two variables. For the 4,240 people who rated
both movies about Moses, the films correlated significantly (r = 0.33). Just 2,634 people rated
both Ten Commandments and Last Temptation of Christ, and the correlation was only 0.02. A
total of 1,653 rated Last Temptation of Christ and Prince of Egypt, with a preference correlation
of only 0.05. A recent publication, using a slightly different subset of the NetFlix data, found a
solid positive correlation (0.31) between Ten Commandments and the reverent 2004 film, The
Passion of Christ (Bainbridge 2007b).
         The fact that many people rated both Moses films, but fewer rated either of them with the
controversial film about Jesus, suggests that there is a second way to code preference data – not
in terms of which scale rating was given, but whether a film was rated at all. I recoded the
ratings so that 1 represented any rating and 0 represented no rating. This analysis produced three
negative correlations, suggesting that the three films had significantly different audiences. The
two Moses films had a moderate negative correlation (-0.23), and the two live action films had a
somewhat larger one (-0.37). But there was a huge negative correlation between Prince of Egypt
and Last Temptation of Christ (-0.60), probably because the former is a cartoon feature which
families may have watched with their children, whereas the latter is decidedly an adult film.
         This recoding eliminated the very concept of missing data, so the correlations were based
on fully 42,572 cases. Although these correlations were calculated in a reasonable manner, quite
suitable for comparison purposes, it should be pointed out that the calculation did not include any
of the roughly 357,000 people in the dataset who did not rate any of the three films, something
one might need to consider doing for different research purposes.
         Researchers who want to make use of recommenders systems to chart cultural trends
should realize that people's preferences for cultural products like movies are only partly
determined by their ostensible topics. Also important for films are the featured actor, the year
the film was made, and what might be called the mood, style, or emotional tone of the picture.
An excellent example is what results when the 1959 movie Ben-Hur is entered into MovieLens, a
motion picture recommender system created for research purposes by GroupLens Research at the
University of Minnesota.3 The ten most similar movies, as reflected in correlations between
people's preferences, are:

       Ben-Hur: A Tale of the Christ (1925)
       Spartacus (1960)
       Ten Commandments, The (1956)
       Great Escape, The (1963)
       Patton (1970)
       Bridge on the River Kwai, The (1957)
       Seven Days in May (1964)
       Longest Day, The (1962)
       Fail-Safe (1964)
       Magnificent Seven, The (1960)

        The first of these is the silent film based on the same novel as the 1959 movie. Like Ben-
Hur, Spartacus depicted the Roman Empire and was released just the year after it, however the
ideological content of Spartacus was not Judeo-Christian but class politics. Ten Commandments,
like Ben-Hur, was oriented toward the Bible and starred the same actor, Charlton Heston. The
other films date from roughly the same period as the target film, concern human conflict, and
tend either to have noble main characters or at least to raise issues about nobility of character.
One could say these are all serious action pictures with strong plots, either set in historical
settings, or in the case of the Cold War related movies, Seven Days in May and Fail-Safe,
historical from today's perspective. All have famous main actors. Thus, the religious dimension
of Ben-Hur is only one of the factors that makes it correlate with other films in people's
expressed preferences.
        Movies are a convenient example, but many kinds of products are covered by
recommender systems, and others include items with religious significance. The online
bookseller,, bases its recommender system on actual book-buying behavior, rather
than preferences expressed on a questionnaire scale.'s internal data would be
excellent for research purposes, but what is available online is not very detailed and useful
chiefly for examples. On July 21, 2009, categorized 1,865 items in a general
Religion and Spirituality category, with these three heading the best seller ranking:

       The Family: The Secret Fundamentalism at the Heart of American Power by Jeff Sharlet
       The Secret by Rhonda Byrne
       The Biology of Belief: Unleashing the Power of Consciousness, Matter, & Miracles by
              Bruce H. Lipton

         According to its web page, customers who bought The Family also bought
Crazy for God: How I Grew Up as One of the Elect, Helped Found the Religious Right, and
Lived to Take All (or Almost All) of It Back by Frank Schaeffer and four secular books that were
critical of contemporary American culture. Apparently, one popular current theme is conspiracy
theories of American politics, some of which involve religion.
         Customers who bought The Secret also bought three related products by the same author,
plus Law of Attraction: The Science of Attracting More of What You Want and Less of What You
Don't by Michael J. Losier and You Can Heal Your Life by Louise Hay which carries the motto,
"What we think about ourselves becomes the truth for us..." Customers who bought The Biology
of Belief also bought two self-control inspirational books by Dr. Wayne W. Dyer, Excuses
Begone! and No Excuses!, and two mind control books by Lynne McTaggart, The Intention
Experiment: Using Your Thoughts to Change Your Life and the World and The Field Updated
Ed: The Quest for the Secret Force of the Universe. They also bought The Divine Matrix:
Bridging Time, Space, Miracles, and Belief by Gregg Braden. These examples remind one of
The Power of Positive Thinking by Dr. Norman Vincent Peale, and customers who bought that
classic book also bought classic self-help books by Dale Carnegie. Thus, a second popular
category of "Religion and Spirituality" books covers self-control books that vary in the extent to
which they employ religious rather than psychological or pseudoscientific metaphors. does carry many conventionally religious books, but these examples show
how a recommender system can be used to explore ongoing developments in the surrounding
culture that relate to religion without necessarily corresponding with traditional definitions.

3. Geographic Data Analysis

         This approach applies traditional quantitative methods of social ecology to new kinds of
data already available on the Web but little exploited so far. Social scientists have long
compared geographically-based religion-related variables to develop and test theories. Perhaps
the most familiar classic work is Emile Durkheim's 1897 book Suicide, which compared rates of
self-murder between Protestant and Catholic areas of Europe. Less familiar, but at least
available in English, was Henry Morselli's 1882 book on the same topic, which was the source of
many of Durkheim's numbers but less ambitious theoretically. However, the real classic in this
tradition is almost totally unknown, Adolph Heinrich Gotthilf Wagner's 1864 book, Die
Gesetzmässigkeit in den Scheinbar Willkürlichen Menschlichen Handlungen vom Standpunkte
der Statistik, which has never been translated. In my view Wagner's book is by far the most
admirable of the three, not merely for being earlier, but precisely because it is more cautious than
Durkheim in asserting theoretical explanations and does not, like Durkheim, leave out statistics
that inconveniently contradict the theory.
         Given the century and a half tradition of geographic statistics on religion, what Internet
chiefly contributes is access to a large number of new measures, or more convenient access to
data that have been available before. In the early 1980s, I counted classified telephone book
listings for astrologers and new religious movements in both the United States and Canada (Stark
and Bainbridge 1985). While some effort is required to assign them to the correct geographic
units, the chief challenge thirty years ago was finding the phone books in the first place. I
located many in my university library, others in a city's public library, and in a few cases I hired
a student to call information operators in small cities and ask them politely to check their own
local phonebook. For a study of the 22 metropolitan statistical areas in Canada, I actually
obtained my own personal collection of all the paper phonebooks.
         Online telephone directories greatly simplify this work, although they do not remove all
the hand labor. First, one must compare online telephone directories to identify the most
complete one. Typically, one must then work manually state by state in the US, entering the
desired search term or scanning all the listings for churches, because it is hard to write a
computer webcrawler program to do this automatically. For a recent tabulation of astrologers by
state, I found that the most accurate method was to paste each page of astrology listings into a
word processing document, then edit it with a combination of manual labor and search-and-
replace commands, before porting the text into a spreadsheet (Bainbridge 2007a: 117, 254).
Then more work was required to format the data, often simply because different listings had
different numbers of lines of data, and to find duplicate listings that needed to be removed.
Some sense of the magnitude of this work is reflected in the final total of unique listings, which
was 3,859, and three work days were required to prepare the data manually for computer
       Often, a religious denomination or movement lists its centers, clergy, or even members
on a website, that may be used in the same manner to generate geographic rates. Figure 2 shows
five measures I developed from such websites.

                         Figure 2: New Religion Indicators per 100,000

                                                                 Yoga          Yoga
Geographic              Scientology      TM          3HO         Serve        Alliance
Regions of the US        Websites       Centers    Teachers     Teachers      Teachers
New England               2.12           0.22        0.48         6.29          8.17
Middle Atlantic           1.55           0.03        0.22         2.06          6.04
East North Central        1.19           0.04        0.08         0.82          3.20
West North Central        1.30           0.07        0.11         0.74          2.18
South Atlantic            4.01           0.05        0.19         1.14          4.44
East South Central        0.37           0.02        0.03         0.47          1.18
West South Central        0.88           0.03        0.19         0.52          2.11
Mountain                  2.98           0.07        0.69         1.33          6.17
Pacific                   9.60           0.10        0.45         0.87          4.13
USA                       3.26           0.06        0.25         1.30          4.10

        In 1998, the Church of Scientology launched 15,693 personal web pages in 11 languages
for members in 45 nations. Of the total, 8,762 or 55.8 percent were residents of the United States,
and they are tabulated by the nine divisions of the nation in Figure 2. The remaining columns
tabulate data for four Asian-oriented religious or spiritual movements, beginning with rates
based on 178 Transcendental Meditation centers in the United States in 2006. In the same year,
the website of the International Kundalini Yoga Teachers Association, the successor to the
Healthy-Happy-Holy Organization (3HO) of Yogi Bhajan, listed 747 3HO yoga teachers. A
website called Yogaserve listed 3,847 teachers of yoga in the US who have chosen to register,
and the website of the Yoga Alliance listed fully 12,166 teachers.
        Such data are very useful to test or develop theories about the socio-cultural
environments that are hospitable for new religious movements (Stark and Bainbridge 1985). In
general, western areas of the United States have high rates of geographic migration, low rates of
membership in conventional religious organizations, and probably as a consequence have high
rates of new religious movements. However, in Figure 2 as in earlier data, New England has
somewhat high rates, despite having church-member rates comparable to other eastern regions.
Among the theories that could be tested about why this is the case are three: (1) new religious
movements are attracted by the high density of elite educational institutions, (2) for historical
reasons New England is weak in religious sects which would provide an alternative to
mainstream denominations, or (3) something about the socially conscious (e.g. liberal) culture of
New England. Like some earlier data, the table also suggests that the South Atlantic region may
be increasingly open to some kinds of spiritual movements, possibly in retirement communities
in Florida, or secular communities in Florida and around Atlanta and the District of Columbia.
Of course, data on any one new religious movement may reflect its own unique regional history,
and the geographic location of its headquarters, so the availability of data about numerous groups
over Internet is a great benefit for researchers.
         For the kinds of things counted in the above table, it makes perfect sense to use the total
populations of the geographic area to produce rates. In some cases, one might want to use some
subset of the population, such as adults or elderly people, as the divisor. In other cases, one
might need to use a completely different kind of variable for the divisor in a rate. For example
one might divide the number of churches belonging to one denomination by the number of
churches belonging to all denominations. The first column of the table is based on websites
belonging to the Church of Scientology, but established for individual members, so population is
a good divisor. However, for rates with other kinds of websites in the numerator, one might need
websites in the denominator as well.
         For example, one could compare all the web pages hosted by the governments of US
states, to see what fraction of them in each state contained a religion-related word like "church."
At one time, one could get decent geographically-based counts from searching websites in each
of the fifty US state domains, because originally the .us domain was limited to governments.
Thus, one could enter "church" into Google to get all the Massachusetts government
web pages registered in the .us domain that had the word "church" on them. More recently, the
.us domain was opened up, so that citizens and non-governmental organizations can use these
domains, and the implications for social science are not yet clear.
         When basing rates on ratios of websites, one should be alert to the possibility that
relationships will be non-linear, because for example very small-population states may need
pages covering a wide range of topics, almost as wide as for large-population states. The basic
lesson is that one must become familiar with one's data, and think carefully about what social
process produced the cases, in order to know what the statistics actually measure.
         Some researchers may want to invest in developing cooperative relationships with
corporations that have access to geographically-based data through their online business. For
example, Google offers businesses a complex service called Google Analytics, which can
produce maps and tables of the numbers of people accessing a given web page from different
geographic locations.4 In many cases, a company's website provides geographical data but in an
inconvenient form, and thus working with the company to obtain the data directly could be much
more efficient. For example, I just entered the word "Christ" into the eBay website and
discovered 9,397 items for sale whose descriptions contained the word "Christ." For each, I
could manually look at the advertisement to see geographically where the item was, but doing so
for all of them would be exceedingly tedious.

4. Search Engines

        Among the most heavily used online services – and one of the most useful for social
scientists in often unexpected ways – are search engines like Google. Although some details of
each search engine are kept secret by the company offering it, they are based on principles from
the cognitive and social sciences, as well as on computer science. Thus, social scientists of
religion would do well to learn as much as they can about their research potential, and this
section of the current essay can only scratch the surface. A good starting point for readers who
want to learn more is the classic book Finding Out About by Richard Belew (2000).
        When the World Wide Web was launched in the early 1990s, creators of web pages were
encouraged to put keywords describing the page in a hidden area of the HTML code that could
be searched but would not be visible to the average user. Unfortunately, people very quickly
gamed the system, putting popular but irrelevant terms in the code. In addition, as the Web grew
– now with over a trillion pages – it became impossible to search it in realtime. Commercial
search engines index the Web by sending crawler programs out across it looking for new pages.
They categorize web pages in terms of the words in the part of the code visible to users, but for
many searches the number of pages containing the search term is enormous. I just this moment
searched for "God," and Google gave me 469,000,000 web pages on which to find Him!
         One response, exemplified by the Alta Vista search engine, was to allow the user to do
the Boolean searches preferred by librarians. Currently, Alta Vista allows the user to fill in any
of four different text fields: all of these words, this exact phrase, any of these words, and none of
these words. When I just now searched for "God," Alta Vista estimated it could find
1,450,000,000 pages containing this word. When I told it to search for "God" but only on pages
that did not contain "Jesus" or "Christ," the estimated number of hits declined to 1,160,000,000.
Clearly, this is still too large a number of pages for me to visit in this life. Therefore, modern
search engines need to augment the traditional search for keywords with some method for
prioritizing the pages. As it happens, the first hit Alta Vista gave me in this more restrictive
search was a Wikipedia web page listing names of God in Judaism, clearly a very appropriate
page given my search terms. Google's solution to the prioritization problem was PageRank, an
algorithm based on links between web pages, measuring what fraction of other relevant pages
link to the page in question, thus a measure of its popularity for people interested in the topic of
the search (Brin and Page 1998).
         Most users of search engines seem unaware of the special ways in which they can be
used, both the different ways in which searches can be framed, and the potential uses of the
results of a search. An example of how both kinds of awareness can be useful to the researcher
is the possibility of exploiting the ability of several search engines to limit searches to specified
Internet domains. Googling " God site:edu" gives you 4,350,000 pages that contain the word
"God" which are in the ".edu" domain reserved primarily for US educational institutions.
Googling "God site:gov" gives you the 826,000 US government pages that refer to God. "God" gives you the 8,470 pages mentioning God on the immense website of the National
Institutes of Health. Given that different Internet domains represent different provinces of
culture and society, comparing across domains can be useful for social scientists.
         When I did the research for Figure 3 in 2006 (Bainbridge 2007a: 153, 257), Google
estimated that 173,000,000 pages contained the word "God." Of these, 11,900,000 were in the
.edu domain, and 82,200,000 were in the .com domain. The ratio of these two numbers
(.edu/.com) is 0.145 or 14.5 percent. This is a measure of how educational versus how
commercial the concept is, but only if compared with the ratios for other terms. Similarly, the
ratio of .gov to .net pages, 9.4 percent, is a measure of how governmentally official the concept
is. Note that the word "church" has higher ratios, reflecting the fact that churches are important
educational and civic institutions, as well as religious ones. In contrast, words relating to
agnosticism and atheism are relatively rare in official institutions of modern society, despite all
the debates about secularization.
         Another useful search trick is to seek the web pages that link to a particular other web
page. For example, is the home page of The Association of Religion
Data Archives, a prominent online digital library. Googling "" returns 728 hits,
including a list of religion-related websites on the website of Paul Brians of Washington State
                 Figure 3: Google Estimated Frequency of Words on Web Pages

         Pages Containing the Word (thousands)                    Ratios
 Words Domains .edu .com .gov            .net            .edu/.com      .gov/.net
agnostic     5,040     140 2,640    14      230                 5.3%          6.3%
atheism      4,970     109 2,750      1     431                 4.0%          0.1%
atheist      7,660     118 4,740      1     393                 2.5%          0.2%
Bible       68,400 4,460 39,400 199       3,210                11.3%          6.2%
church     160,000 23,100 65,600 2,320    6,420                35.2%         36.1%
God        173,000 11,900 82,200 792      8,460                14.5%          9.4%

        Paul's page offers a good example of how sophisticated users most efficiently use web
pages. A naive user would laboriously click on each link and look at the page it leads to. A
researcher on religion websites should probably do something quite different, opening the source
code from the browser (View/Source in Internet Explorer and View/Page Source in FireFox).
This immediately lets the user see the page description text which Google displays, and Paul's
rather responsible list of hidden keywords – although "cool sites" is debatable:

       <META NAME="DESCRIPTION" CONTENT="Paul Brians' list of outstanding
           Websites relating to the world's religions.">
       <META NAME="KEYWORDS" CONTENT="religion Christianity Islam Judaism
           Jewish Buddhism, Hinduism Catholic Protestant cool sites">

Later in the HTML code the user would see the actual links, the first eight of which are:

       <A HREF="">Academic Info on
       <A HREF="">Esoterica: The Journal of Esoteric
       <A href="" target="new">Internet Sacred Text
       <A HREF=""></A> Religion statistics
       <A HREF="">Review of Biblical Literature</A>
       <A HREF="">3E
              Encyclopedia</A> Information about various bodies of mystical and religious
       <A HREF="">Religion</A>
       <A HREF="">American Religion Data Archive</A>

        Note that each of these gives the link itself plus words that appear on Paul's page where
the user would click to go to the site. A sophisticated user would copy the whole section of the
source doe into a word processor, search and replace every "<" or ">" or quotation mark with a
tab, then dump the result into a spreadsheet, which if set correctly will immediately activate the
links just as if they were on Paul's web page. But now the user can save the information, access
it conveniently later on, and add other information about the sites if desired.
        Search engines not only allow one to map the relationships between websites (or the
topics they represent) in a kind of conceptual space; they can also chart changes over time. For
example, Google offers a trend analysis – or rather a pair of analyses, one based on Google
searches by individuals, and one based on how frequently a given topic has appeared in Google
news stories.6 For example, I entered the word "Scientology" into Google Trends, and got the
graph shown in Figure 4.

               Figure 4: The Result of Entering "Scientology" into Google Trends

Note that Google adds flags to prominent news stories that might explain some of the peaks in
the search popularity graphs. In this case, there were six, and they link to stories as follows:

       A. Stars turn out for Cruise's Scientology wedding
     – Nov 18 2006
       B. Clearwater, Fla.: Scientology stronghold
              Boston Herald – Sep 23 2007
       C. Germany to ban Scientology
              TransWorldNews (press release) – Dec 7 2007
       D. Cruise Scientology Video Surfaces Online
     – Jan 17 2008
       E. Scientology helped Cruise overcome dyslexia
              Frontline – Jan 5 2009
       F. French court tries Church of Scientology
              WOOD-TV – May 25 2009

         In 2005, when the search data peak, Google does not call out a news article, but one can
get a sense of what was happening by entering "Scientology 2005" into the main Google search
portal. The highest ranked site that came up when I did this on June 12, 2009, was an English-
language version of a German government report critical of Scientology,7 which begins with a
link to a news article at the English-language site of the German news magazine, Der Spiegel,
"Germany Prepares to Ban Scientology."8 It appears that public interest in Scientology is
aroused whenever the mass media raise controversies about it, or when a Scientologist celebrity
like Tom Cruise gets unusual publicity.
         If one were doing a research project over a significant period of time, it would be
possible to access and save websites periodically, and then analyze changes. This might work
especially well when a short series of events or a heated online argument made people update the
sites frequently. For a longer-term historical perspective, one may turn to the Wayback machine
of the Internet Archive.9 One enters a website URL, such as, and the
Wayback Machine offers historical versions of it. Wayback archived the Scientology website
three times in 1996, beginning November 14, and 29 times in 2008. The peak year was 2005, in
which it did so 355 times. Interestingly, the Wayback Machine does not include historical pages
from, the most prominent anti-Scientology website, and only this generic message
appears: "Siteowners might have also requested that their sites be excluded from the Wayback
Machine. When this has occurred, you will see a 'blocked site error' message." For a researcher
studying conflict around new religious movements, this constitutes data as much as it does
missing data.
         In earlier research (Bainbridge 2007b), I compared a pro-Scientology website with an
anti-Scientology website, plus three sites about The Family (Children of God), two of them
opposed to that group, by analyzing links to pairs of websites. Entering
"" into the now-defunct MSN search engine
produced all the websites that linked to both Scientology's official site, and to the most
prominent anti-Scientology site. That does not seem to work for Bing, MSN's successor, and
Google returns the sites that link to either of the pair, when what we need is only those that link
to both. However, Alta Vista still permits this kind of search. Completing similar double link
searches for all possible pairs of a set of religion-related sites, would provide data to map their
degrees of similarity, without relying on the words contained on them, even though most uses of
search engines are based on keyword searches. Already, researchers have begun directly
examining the links on religion-oriented web pages as a way of charting the topography of faith
(Scheitle 2005).

5. Natural Language Processing

       For more than four decades (Stone et al. 1966), social scientists have used computers to
analyze written texts, but the recent explosive development of new approaches and online
sources of written materials have greatly expanded the opportunities for this kind of work. In
computer science today, natural language processing (NLP) refers to a major research area and
to numerous software tools for collecting, analyzing, and transforming written text, recordings of
spoken language, and even automatic analysis of human gestures based on computer vision
techniques (Martin 2004). The best-developed and most useful current methods of value to
social scientists of religion involve traditional written text.
         All search engines make some use of the text on websites, but they generally just look for
the words entered in by the user, and augment this information with non-textual information such
as the number of in-coming links to each web page from other web pages having similar textual
content. One example that goes much further is Clusty (, a meta-search-engine that
sends the user's query to several independent databases, combines the results, then clusters them
in terms of the words that distinguish them from each other. For example, entering the word
"Bible" into Clusty returned 230 websites from seven sources, with somewhat overlapping

       Ask - Top 82 results retrieved out of 18,740,000 in 0.091 seconds.
       Gigablast - No results retrieved in 1.039 seconds.
       Live - Top 82 results retrieved out of 83,900,000 in 0.55 seconds.
       NY Times - No results retrieved in 0.321 seconds.
       Open Directory - Top 82 results retrieved out of 8,510 in 0.325 seconds.
       Sponsored Listings - Top 4 results retrieved out of 4 in 0.317 seconds.
       Yahoo! News - Top 10 results retrieved out of 25 in 0.772 seconds.

         The system clustered these sites into ten major categories: Bible Study (33 sites), King
James (32), Bible Search (16), New Testament (15), Audio (16), Ministry, Church (14), Free
Bible (14), Netbible (10), Pictures (10), and (7). While there is nothing
surprising in this list, it does suggest how people use the web to explore the Bible. We see that
for English-speaking users, the King James version still stands out from all the rest. The Audio
sites let the user listen to recordings of people reading chapters, whereas the Picture category
includes sites that offer images depicting Bible stories, and Free Bible identifies sites that offer
free Bibles, free software to read Bibles on the user's PDA, or in other ways combine the words
"free" and "Bible." Indeed, statistical analysis of the copresence of pairs of words on websites is
one of the main tools used for clustering them (Cilibrasi and Vitanyi 2007). Search Bible sites
take the process one step further, by facilitating searching the Bible for desired quotations or
topics, and the last category is simply the most prominent of these sites,
which is especially good for comparing passages across translations and languages.
         Clusty, and other systems like it as they develop in the near future, can be used to explore
the society's orientations toward a very wide range of religious topics, both technical and
esoteric, as well as commonplace. For example, entering "christology" classifies 169 websites
by keywords thus: Doctrine (21 sites), God (19), Bible (15), Definition (11), History (11), Trinity
(11), Bibliography, University (7), Pictures (7), Course, Catholic (6), and Incarnation (6).
Entering the name of a Canadian new religious movement called "Raelians" clusters 188 sites:
Cloning (47 sites), Cult (23), UFO (15), Claude Vorilhon (12), Aliens (14), Blog (13), Intelligent
Design (5), Love (7), Media (8), and UFOland, Raelians Target Las Vegas (5). In fact, the
Raelians were founded by Claude Vorilhon, believe that aliens have brought the truth in UFOs,
stress love and seek to clone human beings, all topics identified by Clusty automatically. The
last category refers to five sites that report the group's latest activity, and one of them says: " The
Raelian Movement is announcing plans to build a UFOland in Las Vegas where visitors can
attend a Happiness Academy and see a full-size replica of a UFO."10
        Serious research of this kind would want to use large corpora of data, which might or
might not be available over Internet, with a professional and well-defined software system for
clustering texts on the basis of the words in them. In fact, many different NLP text analysis
systems have been made available over the past decade, incorporating a range of algorithms, so
the choices are rather daunting for an unassisted social scientist (e.g. Landauer et al. 1998). In
addition, some of the best systems lack conventional user interfaces and require customization
before being used on a particular project, so collaboration with a specialist in NLP may be
        Years ago, developers of computer technology to handle language were optimistic they
could duplicate the nuances of human communication by building grammatical rules and
narrative structures into their programs. An example concerning religion was the remarkable
1980 paper, "A Formal Grammar of Expressiveness for Sacred Legends" (Dreizin 1980), which
asserted: "The best way for a researcher to present his knowledge of folklore is to demonstrate
the ability to construct at least rough approximations of folk stories." This means that a
researcher who really understands religious parables and stories should be able to write a
computer program to generate realistic ones automatically. NLP researchers have backed off
from this level of hubris in recent years, yet it remains an interesting goal for the future. Perhaps
unfortunately the success of purely statistical approaches to analysis of text, often using the "bag
of words" approach that totally ignores grammar and narrative structure, has tended to
overshadow more sophisticated approaches.
        Two more recent papers use religious examples to illustrate the possibility for research at
an intermediary level of linguistic sophistication, seeing analogies or correspondences in the
clustering of words generated by different religions. Tony Veale (2003: 137) pointed out the
value of thinking – and computing – in terms of analogies:

       Whereas a conventional thesaurus is indexed on a single probe word, analogical queries
       require both a source and a target term, to permit a mapping between two domains to be
       constructed. Thus, instead of a simple query "church" or "bible," one can pose much
       more specific queries like "Muslim church" (mosque), "Hindu bible" (the Vedas), "Celtic
       Ares" (Morrigan) or "Jewish German" (Yiddish). Semantic precision thus takes on a very
       different complexion when analogy is involved: though "mosque" and "synagogue" are
       not even near-synonyms, one can say that each forms a perfect correspondence with the
       other in the analogy of a "Muslim synagogue." Thus, one should differentiate between
       semantic precision (the basis of synonymy), and analogical precision (the basis of
       analogy and metaphor).

        Veale's empirical analyses identified analogies across the deities in ancient Greek,
Roman, Hindu, Norse, and Celtic religions, which of course may reflect their common Indo-
European cultural roots. A comparable study by Marx et al. (2002), clustering in terms of the
copresence of common keywords, identified themes addressed by both Buddhism and
Christianity, in such areas as scripture and theology, sin and suffering, characteristics of the
religion's founder, philosophical concepts, and customs and rituals. A second analysis, again
based just on statistical analysis of word usage, compared Buddhism with Islam.
6. Virtual Worlds

        A very large number of new kinds of communication over Internet express the
personalities or public personae of millions of people, which often include religious elements.
Early research is looking at social networking sites such as FaceBook and My Space, and the
recent Twitter fad is concentrating research efforts on the broader category of text messaging.
Perhaps the richest of these online social environments, although not yet the most popular, is
virtual worlds (Bainbridge 2007c). While definitions have not yet stabilized, these are generally
taken to be online computerized environments, visually similar to the real world, in which each
individual is represented by an avatar, and avatars can interact in complex and somewhat
creative ways. Note that avatar itself is an originally religious term.
        Roughly speaking, there are two kinds of virtual worlds, those that came from a tradition
of computer games and those that did not, although leading examples of both kinds are so
complex that the word "game" no longer really applies. Second Life is the best-known non-game
virtual world, and World of Warcraft is the best known one marketed as a game. Some, such as
Entropia Universe, fall between the two categories. All of the leading ones provide wide scope
for social interaction, allow users to create (or at least assemble) their own virtual social groups
and objects, and have some educational potential.
        Much of the best research to date has been qualitative, often from a humanities or "games
studies" perspective, but quantitative studies have begun, such as the EverQuest II research
described above. I begin here with qualitative descriptions, partly to set the stage for quantitative
methods, but also because I see a major role for qualitative research in this area. As a new
cultural form, virtual worlds need to be studied as innovations in their own right, each with its
own distinctive characteristics. In addition, they often depict innovative, exotic, ancient, or
fantasy religions, whose theology and symbolism demand intensive qualitative analysis.
        Second Life is a tool-rich online environment in which users can create their own objects,
including full-scale architecture, and then experience and manipulate them through their avatars.
For example, Vassar College has created a large island in Second Life, duplicating part of its
actual campus but including a full-scale replica of the Sistine Chapel complete with copies of
Michelangelo's artwork across the arched ceiling. The implications for the social science of
religion are suggested in the agreement users' avatars are required to accept upon entry: "Visiting
the Sistine Chapel creates a deeply moving experience for many people for a variety of reasons,
including religious, artistic and educational. To preserve this same experience for those visiting
the Sistine Chapel in Second Life, we expect all visitors to conduct themselves here as they
would in real life: with respect for the environment as well as for those visiting the
        Figure 5 shows a much larger Second Life replica of a religious architectural site, the
Basilica of Saint Francis of Assisi. An avatar can walk through this huge assembly of buildings,
sit on a pew in the chapel to pray, and wander through authentic hallways and stairs. At a vast
Islamic site, Al Andalus Alhambra, one may take an actual 23-minute magic carpet ride over a
small city, dominated by a huge mosque. A somewhat more modest replica of Salzburg
Cathedral is hemmed in by a virtual business district, where one may purchase a variety of
virtual goods, as is also the case for a memorial for the World Trade Center where the twin
towers are as translucent as the proverbial ghosts.
                 Figure 5: The Basilica of Saint Francis of Assisi in Second Life

         It is my impression that many examples of religious architecture in Second Life were
educational or commercial design projects, perhaps with some cultural intent but not intended or
used for real religious services. However, a vast number of small groups meet regularly in
Second Life, some of them religious or spiritual. The Anglican Cathedral of Second Life on
Epiphany Island holds regular worship services, and regular meditations are held on Osho Island
which belongs to what used to be called the Rajneesh Movement. Clearly, interviews or
participant observation are appropriate ethnographic methods for studying these online religious
or spiritual groups. As Figure 5 demonstrates, photography is an appropriate method for
documenting virtual architecture, and it can also be used to document group activities as the next
figure shows.
         Figure 6 records a remarkable moment in May 2008, on a virtual mountaintop in World
of Warcraft, where participants in a major scientific conference are sharing what might truthfully
be called an ecstatic religious experience. I organized the conference, in collaboration with the
magazine Science, to show that a gamelike virtual world could be an environment for legitimate
scientific and scholarly communication. About 200 people attended, with as many as 120 in
attendance at each of three plenary sessions. Their avatars were together in virtual space but
their physical bodies were strewn from Australia through North America and Europe to Russia.
One result was a book of essays by many authors about a variety of social dimensions of virtual
worlds (Bainbridge in press). The mountaintop, near Crossroads in the Barrens on the Kalimdor
continent, holds a virtual memorial to Michel Koiter, a young artist who worked on World of
Warcraft, but who died in 2004 just months before it was released.11 The angel standing at the
peak of the hill is the same form as the ones that resurrect temporarily "dead" avatars at
graveyards in this virtual world (Klastrup 2006). The conference participants marched up the
hill, prayed or meditated briefly, then danced in joyous celebration of Koiter's creativity.

                 Figure 6: A Quasi-Religious Celebration in World of Warcraft

         Figure 7 was taken in the Temple of Mitra in the Aquilonian capital, Tarantia, in the
gamelike world, Age of Conan. The woman in the center is my character, Atlantea, who is just
in the process of casting a protective spell over herself. Above her head is a representation of her
spiritual essence, the naked upper body of a woman with the head of a snake. She is a Tempest
of Set, that is a priestess of the serpent god Set who rules the weather and creates storms. The
man in the foreground is a priest of Mitra, the sun god to whom this temple is devoted. He is not
the avatar of another player, although avatars of users who are Priests of Mitra do often visit this
temple to receive or complete quest assignments. Rather, he is a non-player character (NPC),
operated by simple artificial intelligence (AI) programming. Already at the low levels of
intelligence of these AIs, "interviewing" them can be a valuable method for learning about the
culture. Three other human figures in the scene are statues. The largest of these represents the
former king of Aquilonia, Numedides, who was deposed and slaughtered by the Cimmerian
barbarian warrior, Conan, and now serves as a saint for the followers of Mitra.
                         Figure 7: The Temple of Mitra in Age of Conan

         This is a good point to dispel the myth that online virtual worlds are games for teenagers.
While many players of the games are young, the median age seems to be around 30, and we
would expect that to increase as the technology works its way through the lifespan. Second Life
attempts to keep teenagers in their own Teen Second Life world, and the main world includes an
extensive "red light district" where all kinds of virtual erotic experiences take place, and even
some users function as prostitutes. Age of Conan includes prostitutes, as well, although they all
appear to be NPCs, and many of the quests concern marital infidelity, although they do not
directly depict it. One must register with a credit card to enter Age of Conan, and swear that one
is an adult. Whereas public virtual sexual intercourse is the most "adult" thing seen in Second
Life, the adult content in Age of Conan tends to consist of severed heads, human entrails, and
gore-splattered landscapes.
         The Temple of Mitra scene also stresses the importance of the culture behind the
gamelike virtual worlds, often called the lore. I know that the statue depicts Numedides only
because the very first Conan story, published by Robert E. Howard in 1932, records Conan's
displeasure about the fact the priests had sainted his predecessor.12 Other prominent virtual
worlds, such as The Matrix Online and Lord of the Rings Online, are similarly based on existing
cultural properties, but many, notably World of Warcraft, are entirely new. The emerging
Warcraft culture has been the focus of many recent novels, but there are also extensive online
digital libraries devoted to it. Most prominent are WoWWiki which currently boasts fully
76,554 articles about the Warcraft Universe,13 and Wowhead which among other things offers
detailed descriptions of fully 8,098 quests that can be undertaken in World of Warcraft.14 The
most popular virtual worlds have wikis, user forums with tens of thousands of posts, and myriads
of websites established by the thousands of guilds, clans, corporations or other user groups that
have been created around them. Research on virtual worlds, therefore, can take advantage of a
range of Internet resources that are actually outside these worlds but oriented toward them.
         The religious culture inside the theme-oriented or game-like worlds varies considerably.
The Matrix Online depicted a virtual city of the future that was frozen in the year 1999, so some
of the neighborhoods incidentally possessed churches, but they were not prominent in the action.
In contrast, the vast Amarr Empire in EVE Online is a theocracy that uses religion to dominate
other peoples and to justify harsh treatment of slaves because pain supposedly promotes their
spiritual development. World of Warcraft depicts a wide range of religions – both positively and
negatively – and many of the users' characters are priests, druids, or shamans. At the Cathedral
of the Light in Stormwind city, human characters can learn about a religion that lacks a god but
has an ethic promoting tenacity, respect and compassion. At the Temple of the Moon in
Darnassus city, they may learn about the loving moon goddess Elune and the need to protect
nature from technology. Most of the many religions of NPCs, to which users cannot belong, are
depicted negatively. Among the most extensive and interesting of these is the Scarlet Crusade,
which is devoted to destroying the Undead who linger between life and death, in the belief that
both life and death are good, but their mixture is an abomination.
         The three main competing religions in Age of Conan illustrate the ways in which modern
fantasies of ancient religions continue to fascinate people who are influenced by but not devout
adherents of Christianity. Computer games often depict religion, but it is almost never the
conventional kind – typically Asian, ancient, or fantasy – and players tend to be less religious
than the average (Bainbridge and Bainbridge 2007). Mitra was an actual Indo-Iranian deity,
although the architecture in Aquilonia follows Greek and Roman styles. Set was an actual
ancient Egyptian deity, and the Stygian nation (across the river Styx) that worships Set has many
of the qualities of ancient Egypt. The third chief god, Crom, belongs to the Cimmerian (i.e.
Celtic) barbarians, of whom Conan himself was the most prominent example. Age of Conan
presents Mitra as a god of ethics and hope who did not demand absolute loyalty. In contrast, Set
was an exclusive god who offered his followers unusual powers in return for strict loyalty. Crom
was an aloof creator god in a brutal society, who wanted nothing to do with humans after they
had been born and was contemptuous of any coward who would stoop so low as to pray for help.
People who spend much time in Age of Conan will come to take these concepts for granted, and
it is hard to say at this early stage in research and in the development of online cultures, what the
consequences will be for their general attitude toward religion.
         For most purposes, a key research method in virtual worlds will be participant
observation, which immediately raises issues of the researcher's self presentation. In Second
Life, I use two avatars: (1) Bainbridge Thespian who makes things and participates in
conferences but does not do ethnography, and (2) Interviewer Wilber who does ethnography and
whose public information clearly announces that he is doing research. Clear ethical guidelines
for ethnography in virtual worlds do not yet exist, but many researchers believe that the real-
world anonymity of users provides significant protection to them in most current virtual worlds.
When the aim is to document the culture built into a theme-oriented virtual world like Age of
Conan, it is often necessary to create multiple characters, each with its own distinctive
characteristics to permit studying one dimension of the world. In this case, I created three
characters: the Tempest of Set depicted in Figure 7, a Priest of Mitra, and a Bear Shaman for the
barbarian Cimmerian culture. Complete ethnographic documentation of any of these worlds is a
major endeavor. For example, my book about World of Warcraft is based on running twenty-
two characters a total of 2,300 hours (Bainbridge in press).
         To this point in my ethnographic work inside better than a dozen virtual worlds, I have
taken about 40,000 "screenshot" pictures. These pictures are automatically saved in a particular
location on the computer's hard disk for each virtual world, automatically have the date and time
attached, and can easily be annotated, sorted into subfolders, and edited as desired. Indeed, the
first step in doing such work is figuring out how to take screenshots in the particular world,
which can be as simple as learning which key to press in most worlds or as complicated as
running separate software simultaneously as is required for Entropia Universe. Most of the time,
the edges of the screen contain much information displayed by the user interface, and most of my
screenshots include it, removing the interface only for special pictures such as the three included
here. Whether for data collection or to produce publishable pictures, much effort is often
required to take good screenshots, because one often needs to go through a complex series of
actions over minutes or even hours to get in the right position, and occasionally orchestrate
events so that the desired scene will play out.
         Screenshots are the only way in some worlds to record the text chat – conventionally in
the lower left corner of the screen – through which users primarily communicate. In the case of
Second Life, one may run word processing software simultaneously, and conveniently paste
interesting text from the chat directly into a text document. In World of Warcraft, entering
"/chatlog" into the text chat automatically saves the session as a text file. Although text chat
remains the main medium of communication, all major virtual worlds today include voice
communications, although some users prefer separate voice software, notably TeamSpeak,
Ventrillo, and Skype. Naturally, any speech that can be heard through headphones can be
recorded and transcribed later in the conventional manner. Second Life includes a module to
make video, and some participants in the 2008 World of Warcraft conference used separate
software to take sound videos of the events.
         To conclude this paper, I will illustrate the possibilities for quantitative research in these
worlds, using World of Warcraft. Many of the new online systems have the ability to collect
data, and the most advanced virtual worlds have both in-built search engines and the option to
extend the functionality of the software by writing mod (modification) or add-on software.
World of Warcraft allows users to write programs in a popular scripting language, Lua, so long
as the programs do not confer an unfair competitive advantage on computationally sophisticated
players. An extensive international modding community has grown up, consisting of amateurs
who write programs that run in conjunction with World of Warcraft, and who share and improve
their code (Kow and Nardi in press). Some of these programs are very useful for researchers.
         Researchers interested in a particular online communication system should explore its
capabilities, looking for opportunities to collect data in unexpected ways. For example, World of
Warcraft incorporates a number of tools to help players find others to team up with. I just
logged in as Tarkas, my Orc warrior, and imagined he was about to go on a quest that required a
healer who could protect him as he attacks enemies, and perhaps even resurrect him if he is
"killed." The best healers and resurrectors are priests. He entered "/who priest" into the text chat
system, and immediately a list appeared on the screen of 16 priests who were online at the
moment, using the same Internet server as Tarkas and belonging to the same faction within the
game. The output listed their names, their experience levels, their races, and their locations
within the virtual world, as well as the guild affiliations that are their most significant social
group memberships. Game researchers centered at the Palo Alto Research Center wrote an add-
on program using this /who feature to take automatic censuses of tens of thousands characters
online repeatedly over a period of months (Ducheneaut et al. 2006, 2007). They were especially
interested in the changing status over time of both individual characters and guilds, and social
interactions were central to their research.
        Much data about virtual worlds can be obtained outside them, for example in the
extensive discussion forums in which players report their experiences and share advice. In
addition to WoWWiki and Wowhead, a digital library called the Armory displays several pages
of information for each of the millions of characters who have reached level 10 (out of 80) in the
experience ladder all characters must ascend. For some of my research, I used auxiliary software
called CensusPlus to draw samples of thousands of characters, all those that were online during a
particular day selected for sampling in the particular realms of World of Warcraft in which I had
characters. I then manually looked up subsamples of these characters in the Armory, saving their
pages as XML files then writing a computer program to parse those thousands of files and format
them for a spreadsheet, from which they were ported into a statistical analysis program. Here I
will offer a simpler example.
        Two of my characters belonged to one of the largest user guilds in all of World of
Warcraft, the Alea Iacta Est guild that was created in conjunction with a popular weekly podcast
devoted to this virtual world, The Instance. When I accessed the Armory, it offered extensive
data about fully 4,632 AIE members. It would be possible but very difficult to write a crawler
program that would automatically download all their data – difficult because of the complexity
of decisions about what data to enter where on the page to get back the desired information.
However, I discovered that the main page for the guild had some limited information about all
the members hidden in the XTML source code. For example, here are the lines for my two
characters, Catullus the level 80 Blood Elf priest and Annihila the level 70 Undead death knight:

       <character achPoints="820" classId="5" genderId="0" level="80" name="Catullus"
              raceId="10" rank="6" url="r=Earthen+Ring&amp;n=Catullus"/>
       <character achPoints="250" classId="6" genderId="1" level="70" name="Annihila"
              raceId="5" rank="6" url="r=Earthen+Ring&amp;n=Annihila"/>

         One can quickly infer that classId="5" refers to a priest, and classId="6" to a death
knight. Gender 0 is male, and 1 is female. Race 10 is Blood Elf and race 5 is Undead. It was a
simple matter to copy the 4,632 lines of code into a word processor and make it search for every
quotation mark and replace it with ^t which inserts a tab – a total of 74,112 replacements but they
took just a few seconds. Saved as a plain text file, this mass of data could be opened directly into
a spreadsheet, where a few useless columns could quickly be deleted, making it a dataset ready
for statistical analysis. As Figure 8 shows, female characters are much more likely to be priests
than male characters are, 14.0 percent versus 7.6 percent, a finding replicated again and again in
World of Warcraft datasets.
         Note that female characters in AIE actually have earned more achievement points and
slightly more experience levels on average than male characters, so these virtual women are
certainly not half-hearted wimps. Ranks in guilds are ranged from the guildmaster who is rank 1
down to new members who have ranks of 6 or more in AIE, and female characters are slightly
more likely than males to be guild officers. If this were the report of a research study, rather than
a methodological paper, we would immediately analyze the female nurturant role in the wider
culture, the statistically greater interest of females in religion despite their often lower status in
church hierarchies, and consider how those real-world factors may be reflected in the greater
likelihood of female World of Warcraft characters to be priests. But for present purposes it is
enough to point out that at relatively little effort we were able to assemble a dataset suitable for
statistical analysis in the light of theories relevant to the social science of religion.

                Figure 8: Gender Comparison of 4,632 Members of Alea Iacta Est

                                      Male        Female
Percent Priests                       7.6%         14.0%
Percent Death Knights                13.7%         10.2%
Percent Warriors                      9.4%          2.8%
Mean Achievement Points               533.5         591.4
Mean Experience Level                   48.7          49.7
Percent Guild Rank <6                 2.6%          4.0%
Cases                                  3321          1311


        Several of the methods described above allow researchers to do conventional social
science in a new setting. However, we also see the potential for transforming areas of social
science in profound ways. For example, using recommender systems and search engines to
cluster religious phenomena and map them conceptually is a form of twenty-first century cultural
anthropology. Internet offers direct access to much of modern culture. While one can perform
ethnography in these cultures, as I have done in a dozen virtual worlds, one may also do
quantitative studies of the dynamic structure of cultures and subcultures. Empirical studies will
have implications for theory. Given the importance of the concept of culture in studies of
religion, and the ability to examine how social interaction intertwines with cultural evolution,
these research methods could be exceedingly valuable in the future of social science of religion.


