Embed
Email

Submitted for your review random sampling filters in gawk

Document Sample

Shared by: hedongchenchen
Categories
Tags
Stats
views:
0
posted:
11/27/2011
language:
English
pages:
5
Submitted for your review: random sampling filters in gawk



Submitted for your review: random sampling filters

in gawk



Source: http://newsgroups.derkeiler.com/Archive/Comp/comp.lang.awk/2008−01/msg00011.html







• From: "steven.huwig@xxxxxxxxx"

• Date: Wed, 2 Jan 2008 20:44:01 −0800 (PST)



After becoming a little curious about random sampling and how to do it

without knowing the number of records in advance, I found Waterman's

Algorithm R (by way of Knuth's TAOCP Vol. 2), and Vitter's Algorithm Z

at .



The nice features of these algorithms are that they take space

proportional to the sample size, and that they work without scanning

the entire file to determine how many records are present. I think

they are appropriate to be implemented as Unix stdin/stdout pipeline

filters.



I wrote both versions up in gawk and figured I'd post them somewhere

where people can poke holes in them, use them, or change them if they

like.



I know that there are a lot of useless comparisons because I used

patterns instead of explicit loops and getline for flow control, but

it's just so darn readable for me this way. :−)



On the other hand, the variable names are not readable. I took them

directly from the algorithm pseudocode to make checking the

implementation a bit easier.



My implementation preserves the input order when it outputs the random

sample. I think this behavior is more useful in shell, where you work

with things like uniq or join that require the inputs to have some

sort of order.



Here's a comparison timing on my system.



scully:~/bin steve$ time bzcat words.bz2 | vitter−sample.awk −v

n=10000 > /dev/null



real 0m54.739s

user 0m28.725s

sys 0m10.722s

scully:~/bin steve$ time bzcat words.bz2 | waterman−sample.awk −v

n=10000 > /dev/null



Submitted for your review: random sampling filters in gawk 1

Submitted for your review: random sampling filters in gawk





real 1m35.501s

user 0m50.242s

sys 0m19.479s

scully:~/bin steve$ bzcat words.bz2 | wc −l

11746850



Comments are welcome.



−− Steve





# Waterman's Algorithm R for random sampling

# by way of Knuth's The Art of Computer Programming, volume 2



BEGIN {

if (!n) {

print "Usage: sample.awk −v n=[size]"

exit

}

t=n

srand()

}



NR n {

t++

M = int(rand()*t) + 1

if (M "/dev/

stderr"

exit

}

# gawk needs a numeric sort function

# since it doesn't have one, zero−pad and sort alphabetically

pad = length(NR)

for (i in pool) {

new_index = sprintf("%0" pad "d", i)

newpool[new_index] = pool[i]

}



Submitted for your review: random sampling filters in gawk 2

Submitted for your review: random sampling filters in gawk

x = asorti(newpool, ordered)

for (i = 1; i n && t > thresh {

if (!W) {

W = exp(−log(rand())/n)

term = t − n + 1

}

while (1) {

U = rand()

X = t*(W − 1.0)

S = int(X)

lhs = exp(log(((U*(((t + 1)/term)^2))*(term + S))/(t + X))/n)

rhs = (((t + X)/(term + S))*term)/t

if (lhs numer_lim; numer−−) {

y = (y*numer)/denom

denom−−

}

W = exp(−log(rand())/n)

if (exp(log(y)/n) n && t V) {

S++

t++

num++

quot = (quot*num)/t

}

if (SKIP_RECORDS(S))

READ_NEXT_RECORD(int(n * rand()))

else

exit

next

}



NR "/dev/

stderr"

exit

}

# gawk needs a numeric sort function

# since it doesn't have one, zero−pad and sort alphabetically

pad = length(NR)

for (i in pool) {

new_index = sprintf("%0" pad "d", i)

newpool[new_index] = pool[i]

}

x = asorti(newpool, ordered)

for (i = 1; i 0

}



function READ_NEXT_RECORD(idx) {

rec = places[idx]

delete pool[rec]

pool[NR] = $0

places[idx] = NR

}



.









Submitted for your review: random sampling filters in gawk 5



Related docs
Other docs by hedongchenchen
spec_2_
Views: 0  |  Downloads: 0
Life Expectancy Table
Views: 0  |  Downloads: 0
sbda tender document
Views: 0  |  Downloads: 0
Momentum010111
Views: 0  |  Downloads: 0
PVK06_DesignAndCoding
Views: 0  |  Downloads: 0
80R4852 TAD-D
Views: 0  |  Downloads: 0
spring_06
Views: 0  |  Downloads: 0
The 451 Group
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!