ljmodeling

Reviews
Shared by: Sam Chase
Stats
views:
177
rating:
not rated
reviews:
0
posted:
10/2/2007
language:
English
pages:
0
A Matter of Life or Death: Modeling Blog Mortality Gina Venolia Microsoft Research One Microsoft Way Redmond, WA 98007 USA ginav@microsoft.com ABSTRACT This paper presents a simple model of blog life and death and fits it to raw data from the LiveJournal blog host site. The model reproduces much of the behavior observed in the raw data. It allows other quantities to be computed, such as the average rate of posting per blog, the proportion of dead blogs, and the “half life” of a population of blogs. Author Keywords Table 1. Days (n) 1 7 30 Number of blogs updated in n days ending 11-Mar-03 218,190 521,376 759,434 Table 1. Raw data from LiveJournal. Web log, weblog, blog, modeling. ACM Classification Keywords H5.3 Group and Organization Interfaces: Asynchronous interaction. INTRODUCTION A web log, or more popularly blog, is a website that is something like a diary. It is made up primarily of a reverse chronological list of entries, known as posts. Blogs are a relatively new Internet phenomenon, growing in popularity since 2000. Blogs are only now starting to be the subject of academic study [2, 3]. There are several services and tools that can be used for blogging [4]. One popular blog host is LiveJournal (www.livejournal.com), henceforth referred to as LJ. LJ has publicly released aggregate account statistics including the number of new blogs and new posts by day. This raw data provides an opportunity for data analysis to get a better understanding of the dynamics of blogs. In particular this paper presents a simple model for blog mortality that closely fits the raw data. RAW DATA The bulk of the data is the number of new blogs and new posts per day from http://www.livejournal.com/stats/ stats.txt, summarized in Figure 1. While intermittent raw data extends back as far as 1997, the analysis is done only on data since 1-Mar-99, when the data began to stabilize. The graphs show activity from 1-Jan-00, when activity started to take off. The last reported post data is on 11-Mar03; the last reported new blog data is 31-Oct-03. 3.0 2.5 2.0 1.5 1.0 0.5 0.0 1-Jan-00 Blogs per Day (1,000's) Posts per Day (100,000's) 1-Jan-01 1-Jan-02 1-Jan-03 1-Jan-04 This paper uses three data tables from the LJ data, a subset of the available data set. The first is a brief table from http://www.livejournal.com/stats.bml that gives the number of accounts updated in a window of days, reproduced in Figure 1. Raw number of new blogs and new posts per day from LiveJournal data set (rescaled to fit on a common axis). Figure 1 shows a noticeable weekly fluctuation. Averaging using a sliding window of seven data points produces smoother data sets, shown in Figure 2. (In all the subsequent analysis these smoothed values for the daily posts are used, while the raw daily new blog values are used for reasons that will be clear later on.) Figure 2 shows an anomaly in the number of new blogs in the middle of 2001. Apparently the LJ service was experiencing severe growing pains during this time. In response to the exponential growth and subsequent failure, the LJ team changed from an open registration model to a 1 more restrictive one. This period is blocked out of subsequent analysis as noted below. 3.0 2.5 2.0 1.5 1.0 0.5 0.0 1-Jan-00 Smoothed Blogs per Day (1,000's) Smoothed Posts per Day (100,000's) Let d (for decay) be 1-m, the probability that a blog will live through the night. So in any given day the number of active blogs is the number of active blogs for the previous day multiplied by d plus the number of blogs created that day. Figure 4 shows the estimated number of active blogs under varying values of d. The next section is devoted coming up with a value for d that matches the observed data. (Decay rates below 99% have been examined but are not shown because the resulting models don’t come remotely close to matching the observed behavior.) 1,600,000 1,400,000 1,200,000 99.00% 99.10% 99.20% 99.30% 99.40% 99.50% 99.60% 99.70% 99.80% 99.90% 100.00% 1-Jan-01 1-Jan-02 1-Jan-03 1-Jan-04 1,000,000 800,000 600,000 400,000 200,000 0 1-Jan-00 Figure 2. Number of new blogs and new posts smoothed using a moving average of seven entries to remove weekly fluctuations. The weekly periodicity is summarized in Figure 3. The lows on Friday and Saturday suggest that blogging is being done in the evenings, and is preempted by social activities. New Blogs Posts 14.28% 1-Jan-01 1-Jan-02 1-Jan-03 1-Jan-04 Figure 4. The number of estimated active blogs at various decay rates. Note that the estimation function is a form of smoothing filter, thus the number of new blogs per day can be used directly, without smoothing for weekly fluctuation. FINDING THE DECAY RATE 0.00% Mon Tue Wed Thu Fri Sat Sun This section describes three means of estimating the decay rate. The estimated number of active blogs can be divided into the number of posts per day, as shown in Figure 5. 2 99.00% 99.10% 99.20% 99.30% 99.40% 99.50% 99.60% 99.70% 99.80% 99.90% 100.00% Figure 3. Blog and post creation show a weekly cycle. MODELING BLOG MORTALITY In trying to answer even the most basic questions about blogging behavior from this data, such as the average number of posts per blog per day, one immediately wishes to have an estimate of the number of blogs that are being actively maintained by their owners. To make an estimate, the notion of active needs a clear definition. A very simplistic model for blog life and death might follow these rules: • • • When a blog is created, it is active, and may receive posts. At some point the blog may die after which it is forever inactive, and does not receive further posts. A blog has some probability m (for mortality) of dying in any given day. Posts per Day 1 0 1-Jan-00 1-Jan-01 1-Jan-02 1-Jan-03 1-Jan-04 Figure 5. Posts per day per estimated active blog at various decay rates. Let us assume that the rate of posting to any given blog is constant. The mean posts per day per blog can be computed. (For this and all subsequent analysis only data 1Jul-00 through 1-Jul-01 and 1-Jan-00 through the end of the 2 set are used, avoiding anomalies in early 2000 and mid2001.) The flatness of the posts per day per blog as measured by its standard deviation will be used as the first (and least precise) means of estimating the decay rate. A low value indicates a good fit. Multiplying the mean posts per day per estimated active blog by the estimated number of active blogs (Figure 4) yields an estimate of the expected number of posts, shown in Figure 6. The fit of the estimated to the actual total posts per day will be used as the second means of estimating the decay rate. The fit is measured using a two-tailed, paired Ttest (n = 801). A high value indicates a good fit. 250,000 200,000 150,000 100,000 50,000 0 1-Jan-00 Actual 99.00% 99.10% 99.20% 99.30% 99.40% 99.50% 99.60% 99.70% 99.80% 99.90% 100.00% recent posts will be used as the third and final means of estimating the decay rate. The fit is measured using a twotailed, paired T-test (n = 3). A high value indicates a good fit. Figure 8 shows these the three measures of fit across a range of decay rates. The two peaks of the T-tests give two slightly different estimates of the value of the decay rate for this data set, as shown in Table 2. The standard deviation is very close to its minimum at both decay rates, validating both. For the remainder of the paper, both peak decay rates will be used. Given these decay rates, the mean interval between posts is 3.0 days, differing from other reports of 5.0 [2] and 14 [1]. 1.0 0.8 0.6 0.4 0.2 0.0 99.0% Flatness of Posts per Day per Blog Fit of Est. to Actual Total Posts per Day Fit of Est. to Actual Blogs with Recent Posts 1-Jan-01 1-Jan-02 1-Jan-03 1-Jan-04 99.2% Figure 6. Estimated and actual number of total posts per day at various decay rates. 1,000,000 900,000 800,000 Number of Blogs 700,000 600,000 500,000 400,000 300,000 200,000 100,000 0 0 5 10 15 20 25 30 Number of Days in Window Ending 11-Mar-03 99.4% 99.6% Decay Rate 99.8% 100.0% Figure 8. Three measures of the fit of the model, indicating that the actual decay rate between 99.80% and 99.83%. Actual 99.00% 99.10% 99.20% 99.30% 99.40% 99.50% 99.60% 99.70% 99.80% 99.90% 100.00% Decay rate 99.80% 99.82% Flatness of posts per day per blog 0.2858 0.2841 Fit of est. to actual total posts per day 0.9900 0.0000 Fit of est. to actual blogs w/ recent posts 0.6250 0.9990 Mean Posts per Day 0.3352 0.3237 Est. Active Blogs at 31-Oct-03 694,979 737,400 Table 2. Various measures of the model for the two best decay rates, i.e. the peaks in Figure 8. Figure 9 and Figure 10 compare the two best decay rate estimates to the actual data over time. Note the slight difference between the resulting curves. Despite the high value of the T-test for fit of estimated to actual number of blogs with recent posts, Figure 10 shows significant errors in magnitude. The RMS error between the estimated and actual values is 49,115 for d = 99.80% and 45,915 for 99.82%, or over 10% of the magnitude of the values. The estimated number of active blogs can be taken as a fraction of the total blogs created to date, as shown in Figure 11. As of 31-Oct-03 the percentage of estimated active to total blogs is 49.1% and 52.1% for the two decay rates, respectively, differing from other reports of 34.0% [1]. Figure 7. Estimated and actual number of blogs updated in the n days before 11-Mar-03 at various decay rates. The mean posts per day per blog can be combined with the estimated number of active blogs to estimate the number of blogs updated in any window of consecutive days. The Poisson function p(0, meand) gives the probability that a particular blog will not be updated in a given day. By raising this to the power of the number of days, subtracting the result from one, and then multiplying by the estimated number of active blogs, the number of blogs that have been updated within the window can be estimated, as shown in Figure 7. The fit of the estimated to the actual blogs with 3 250,000 200,000 Posts per Day 150,000 100,000 50,000 0 1-Jan-00 1-Jan-01 1-Jan-02 CONCLUSIONS This model assumes a simple model of blog life and death, a constant mortality rate over the lifespan of the blog and over calendar time, and a constant rate of posting per blog over the lifespan of the blog and over calendar time. No doubt all of these assumptions are flawed. Yet this simple model accounts for a substantial portion of the variability in observed blogging behavior. Actual Decay = 99.80% Decay = 99.82% 1-Jan-03 1-Jan-04 This work only begins to characterize what’s happening in blogs. There are many ways that it could be extended and refined. Most notably the model needs to be extended to explain the prediction errors shown in Figure 10; in particular this may call into question the assumption that the mortality rate is flat over the lifespan of the blog. Further analysis could be done from this data. For example it may be possible to model the distribution of ages of active blogs over time. A better narrative and statistical understanding of LJ through the start-up phase of 2000-01 would help ground this work. In particular the change in subscription model that was imposed in mid-2001 would provide an interesting point of comparison. The scope of this work could be extended. While LJ hasn’t published more recent data, it is there to be mined; complete data from 2003 would provide two complete years of data for examining cyclic patterns.. This work is based on broad population statistics; the fine-grained blog and post data is extant and could be examined more closely. This work looks at the behavior for a specific host; there is an opportunity for comparing behavior across and among hosts. This work looks at behavior without regard to the type of blog, while there may be interesting variations by genre. While dead blogs are plentiful, live blogs appear to be plentiful, increasing in number, active and surprisingly durable. REFERENCES Figure 9. Estimated and actual total posts per day at the two best decay rates. 600,000 Number of Blogs 500,000 400,000 300,000 200,000 100,000 0 0 5 10 15 20 25 30 Number of Days in Window Ending 11-Mar-03 Actual 99.80% 99.82% Figure 10. Estimated and actual number of blogs updated during a window of dates at the two best decay rates. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1-Jan-00 1-Jan-01 1-Jan-02 1-Jan-03 1-Jan-04 Decay = 99.80% Decay = 99.82% 1. Henning, J. (2003). The blogging iceberg. Perseus Development Corporation. http://www.perseus.com/blogsurvey/thebloggingiceberg.html 2. Herring, S. C., Scheidt, L. A., Bonus, S., & Wright, E. (2004). Bridging the gap: A genre analysis of weblogs. In Proc. HICSS-37. IEEE Press. 3. Kumar, R, Novak, J., Raghavan, P. and Tomkins, A. (2003). On the bursty evolution of blog space. Proc. Conf. on WWW, p. 568-576. ACM Press. 4. Rubenking, N. Blog tools. PC Magazine, December 30, 2003. http://www.pcmag.com/print_article/0,3048,a=113569,00.asp Figure 11. Proportion of active blogs to total blogs. One interesting way to look at the decay rate is to compute the “half life” of a population of blogs, i.e. the time it takes for half of them to die. The two decay rates here suggest that the half life of blogs on LJ is about a year (351 and 392 days, respectively). 4

Shared by: Sam Chase
Other docs by Sam Chase
antiwiki[1]
Views: 179  |  Downloads: 0
CadleLL[1]
Views: 272  |  Downloads: 1
code
Views: 253  |  Downloads: 2
Learning_landscape
Views: 303  |  Downloads: 6
antiwiki
Views: 245  |  Downloads: 3
misrepofindigionous
Views: 179  |  Downloads: 0
out
Views: 191  |  Downloads: 0
paolillo-mercure-wright.final
Views: 244  |  Downloads: 1
p97-hirschheim
Views: 204  |  Downloads: 0
habermas
Views: 155  |  Downloads: 0
blogjournalism
Views: 535  |  Downloads: 25
onlinewriting
Views: 204  |  Downloads: 0
we_media
Views: 19391  |  Downloads: 18
textprelivejournal
Views: 244  |  Downloads: 0
studiesinhighered
Views: 150  |  Downloads: 0