Submitted for your review: random sampling filters in gawk
Submitted for your review: random sampling filters
in gawk
Source: http://newsgroups.derkeiler.com/Archive/Comp/comp.lang.awk/2008−01/msg00011.html
• From: "steven.huwig@xxxxxxxxx"
• Date: Wed, 2 Jan 2008 20:44:01 −0800 (PST)
After becoming a little curious about random sampling and how to do it
without knowing the number of records in advance, I found Waterman's
Algorithm R (by way of Knuth's TAOCP Vol. 2), and Vitter's Algorithm Z
at .
The nice features of these algorithms are that they take space
proportional to the sample size, and that they work without scanning
the entire file to determine how many records are present. I think
they are appropriate to be implemented as Unix stdin/stdout pipeline
filters.
I wrote both versions up in gawk and figured I'd post them somewhere
where people can poke holes in them, use them, or change them if they
like.
I know that there are a lot of useless comparisons because I used
patterns instead of explicit loops and getline for flow control, but
it's just so darn readable for me this way. :−)
On the other hand, the variable names are not readable. I took them
directly from the algorithm pseudocode to make checking the
implementation a bit easier.
My implementation preserves the input order when it outputs the random
sample. I think this behavior is more useful in shell, where you work
with things like uniq or join that require the inputs to have some
sort of order.
Here's a comparison timing on my system.
scully:~/bin steve$ time bzcat words.bz2 | vitter−sample.awk −v
n=10000 > /dev/null
real 0m54.739s
user 0m28.725s
sys 0m10.722s
scully:~/bin steve$ time bzcat words.bz2 | waterman−sample.awk −v
n=10000 > /dev/null
Submitted for your review: random sampling filters in gawk 1
Submitted for your review: random sampling filters in gawk
real 1m35.501s
user 0m50.242s
sys 0m19.479s
scully:~/bin steve$ bzcat words.bz2 | wc −l
11746850
Comments are welcome.
−− Steve
# Waterman's Algorithm R for random sampling
# by way of Knuth's The Art of Computer Programming, volume 2
BEGIN {
if (!n) {
print "Usage: sample.awk −v n=[size]"
exit
}
t=n
srand()
}
NR n {
t++
M = int(rand()*t) + 1
if (M "/dev/
stderr"
exit
}
# gawk needs a numeric sort function
# since it doesn't have one, zero−pad and sort alphabetically
pad = length(NR)
for (i in pool) {
new_index = sprintf("%0" pad "d", i)
newpool[new_index] = pool[i]
}
Submitted for your review: random sampling filters in gawk 2
Submitted for your review: random sampling filters in gawk
x = asorti(newpool, ordered)
for (i = 1; i n && t > thresh {
if (!W) {
W = exp(−log(rand())/n)
term = t − n + 1
}
while (1) {
U = rand()
X = t*(W − 1.0)
S = int(X)
lhs = exp(log(((U*(((t + 1)/term)^2))*(term + S))/(t + X))/n)
rhs = (((t + X)/(term + S))*term)/t
if (lhs numer_lim; numer−−) {
y = (y*numer)/denom
denom−−
}
W = exp(−log(rand())/n)
if (exp(log(y)/n) n && t V) {
S++
t++
num++
quot = (quot*num)/t
}
if (SKIP_RECORDS(S))
READ_NEXT_RECORD(int(n * rand()))
else
exit
next
}
NR "/dev/
stderr"
exit
}
# gawk needs a numeric sort function
# since it doesn't have one, zero−pad and sort alphabetically
pad = length(NR)
for (i in pool) {
new_index = sprintf("%0" pad "d", i)
newpool[new_index] = pool[i]
}
x = asorti(newpool, ordered)
for (i = 1; i 0
}
function READ_NEXT_RECORD(idx) {
rec = places[idx]
delete pool[rec]
pool[NR] = $0
places[idx] = NR
}
.
Submitted for your review: random sampling filters in gawk 5