Advanced HBase
Document Sample


Advanced
HBase
Navteq
Architect
Summit,
December
2010
Lars
George
lars@cloudera.com
About
Me
• SoCware
Engineer
• Cloudera
SoluGon
Architect
• Formerly
CTO
of
WorldLingo
• Scaleable
system
aficionado
• Working
with
HBase
since
end
of
2007
• Apache
HBase
CommiRer
(larsgeorge@apache.org)
• European
HBase
Ambassador
(self
proclaimed)
Outline
• Why
HBase?
• MapReduce
with
HBase
• IntegraGon
with
Indexing
• Advanced
Techniques
Why
Hadoop/HBase?
• Datasets
are
constantly
growing
and
intake
soars
– Yahoo!
has
>82PB
and
>25k
machines
– Facebook
adds
15TB
per
day,
>36PB
raw
data,
>2200
machines
– Are
you
“throwing”
data
away
today?
• TradiGonal
databases
are
expensive
to
scale
and
inherently
difficult
to
distribute
• Commodity
hardware
is
cheap
and
powerful
– $1000
buys
you
4-‐8
cores/4GB/1TB
– 500GB
15k
RPM
SAS
nearly
$500
• Need
for
random
access
and
batch
processing
– Hadoop
only
supports
batch/streaming
History
of
Hadoop/HBase
• Google
solved
its
scalability
problems
– “The
Google
File
System”
published
October
2003
• Hadoop
DFS
– “MapReduce:
Simplified
Data
Processing
on
Large
Clusters”
published
December
2004
• Hadoop
MapReduce
– “BigTable:
A
Distributed
Storage
System
for
Structured
Data”
published
November
2006
• HBase
Hadoop
IntroducGon
• Two
main
components
– Hadoop
Distributed
File
System
(HDFS)
• A
scalable,
fault-‐tolerant,
high
performance
distributed
file
system
capable
of
running
on
commodity
hardware
– Hadoop
MapReduce
• SoCware
framework
for
distributed
computaGon
• Significant
adopGon
– Used
in
producGon
in
hundreds
of
organizaGons
– Primary
contributors:
Yahoo!,
Facebook,
Cloudera
HDFS:
Hadoop
Distributed
File
System
• Reliably
store
petabytes
of
replicated
data
across
thousands
of
nodes
– Data
divided
into
64MB
blocks,
each
block
replicated
three
Gmes
• Master/Slave
architecture
– Master
NameNode
contains
block
locaGons
– Slave
DataNode
manages
block
on
local
file
system
• Built
on
commodity
hardware
– No
15k
RPM
disks
or
RAID
required
(nor
wanted!)
HDFS
Example
• Store
1TB
flat
text
file
on
10
node
cluster
– Can
use
Java
API
or
command
line
./hadoop
dfs
-‐put
./srcFille
/destFile
– File
split
into
64MB
blocks
(16,384
total)
– Each
block
sent
to
three
nodes
(49,152
total,
3TB)
– Has
noGon
of
racks
to
ensure
replicaGon
across
disGnct
clusters/geographic
locaGons
– Build
in
check-‐summing
(CRC)
MapReduce
• Distributed
programming
model
to
reliably
process
petabytes
of
data
using
its
locality
– Built-‐in
bindings
for
Java
and
C
– Can
be
used
with
any
language
via
Hadoop
Streaming
• Inspired
by
map
and
reduce
funcGons
in
funcGonal
programming
Input
-‐>
Map()
-‐>
Copy/Sort
-‐>
Reduce()
-‐>
Output
MapReduce
Example
• Perform
“word
count”
on
1TB
file
in
HDFS
– Map
task
launched
for
each
block
of
file
– Within
each
task,
Map
funcGon
called
for
each
line:
Map(LineNumber,
LineString)
• For
each
word
in
LineString
-‐>
Output(Word,
1)
– Map
output
is
sorted,
grouped
and
copied
to
reducer
– Reduce(Word,
List)
called
for
each
word
• Output(Word,
Length(List))
– Final
output
contains
total
count
for
each
word
Hadoop…
• …
is
designed
to
store
and
stream
extremely
large
datasets
in
batch
• …
is
not
intended
for
realCme
querying
• …
does
not
support
random
access
• …
does
not
handle
billions
of
small
files
well
– Less
than
default
block
size
of
64MB
and
smaller
– Keeps
“inodes”
in
memory
on
master
• …
is
not
supporGng
structured
data
more
than
unstructured
or
complex
data
That
is
why
we
have
HBase!
Why
HBase?
• QuesGon:
Why
HBase
and
not
<put-‐your-‐favorite-‐nosql-‐
soluCon-‐here>?
• What
else
is
there?
– Key/value
stores
– Document-‐oriented
stores
– Column-‐oriented
stores
– Graph-‐oriented
stores
• Features
to
ask
for
– In
memory
or
persistent?
– Strict
or
eventual
consistency?
– Distributed
or
single
machine
(or
aCerthought)?
– Designed
for
read
and/or
write
speeds?
– How
does
it
scale?
(if
that
is
what
you
need)
Key/Value
Stores
• Choices
(a
small
selecGon)
– MemCached,
– Tokyo
Cabinet,
MemCacheDB,
Membase,
Redis
– Voldemort,
Dynomite,
Scalaris
– Dynamo,
Dynomite
– Berkeley
DB
• Pros
– Used
as
caches
– Simple
APIs
– Fast
• Cons
– Keys
must
be
known
(or
recomputed)
– Scale
only
with
manual
intervenGon
(consistent
hashing
etc.)
– Cannot
represent
structured
data
Document
Stores
• More
choices
– MongoDB
– CouchDB
• Pros
– Structured
data
supported
– Schema
free
– Supports
changes
to
documents
without
reconfiguraGon
– May
support
secondary
indexes
and/or
search
• Cons
– Everything
is
stored
in
the
same
place,
does
not
work
well
with
heterogeneous
payloads
– Scalability
is
either
not
proven
or
similar
to
RDBMS
models
– Not
well
integrated
with
MapReduce
(no
block
loads
or
locality
advantages)
Column-‐Oriented
Stores
• Hybrid
architectures
– HBase,
BigTable
– Cassandra
– VerGca,
C-‐Store
• Pros
– Allow
access
to
only
relevant
data
• Cons
– Limit
funcGonality
to
fit
model
Which
One
To
Choose?
• Key/value
stores
– Caches
– Simple
data
– Need
for
speed
• Document
stores
– Evolving
schemas
– Higher
level
document
related
features
• Column-‐oriented
stores
– Scalability
– Mixture
of
payloads
Which
One
To
Choose?
• In
Memory
or
On-‐Disk
– Cache
or
Database
• Strict
consistency
– Easy
to
handle
on
ApplicaGon
level
– Content-‐management
systems,
banking
etc.
• Eventual
consistency
– Higher
availability
but
may
read
stale
data
– Deal
with
conflict
resoluGon
and
repairs
in
your
code
– Shopping
carts,
Gaming
What
is
HBase?
• Distributed
• Column-‐Oriented
• MulG-‐Dimensional
• High-‐Availability
(CAP
anyone?)
• High-‐Performance
• Storage
System
Project
Goals
Billions
of
Rows
*
Millions
of
Columns
*
Thousands
of
Versions
Petabytes
across
thousands
of
commodity
servers
HBase
is
not…
• A
SQL
Database
– No
joins,
no
query
engine,
no
types,
no
SQL
– TransacGons
and
secondary
indexes
only
as
add-‐ons
but
immature
• A
drop-‐in
replacement
for
your
RDBMS
• You
must
be
OK
with
RDBMS
anC-‐schema
– Denormalized
data
– Wide
and
sparsely
populated
tables
– Just
say
“no”
to
your
inner
DBA
Keyword:
Impedance
Match
HBase
Architecture
• Table
is
made
up
of
any
number
if
regions
• Region
is
specified
by
its
startKey
and
endKey
– Empty
table:
(Table,
NULL,
NULL)
– Two-‐region
table:
(Table,
NULL,
“com.cloudera.www”)
and
(Table,
“com.cloudera.www”,
NULL)
• Each
region
may
live
on
a
different
node
and
is
made
up
of
several
HDFS
files
and
blocks,
each
of
which
is
replicated
by
Hadoop
HBase
Architecture
(cont.)
• Two
types
of
HBase
nodes:
Master
and
RegionServer
• Special
tables
-‐ROOT-‐
and.META.
store
schema
informaGon
and
region
locaGons
• Master
server
responsible
for
RegionServer
monitoring
as
well
as
assignment
and
load
balancing
of
regions
• Uses
ZooKeeper
as
its
distributed
coordinaGon
service
– Manages
Master
elecGon
and
server
availability
HBase
Tables
• Tables
are
sorted
by
Row
in
lexicographical
order
• Table
schema
only
defines
its
column
families
– Each
family
consists
of
any
number
of
columns
– Each
column
consists
of
any
number
of
versions
– Columns
only
exist
when
inserted,
NULLs
are
free
– Columns
within
a
family
are
sorted
and
stored
together
– Everything
except
table
names
are
byte[]
(Table,
Row,
Family:Column,
Timestamp)
-‐>
Value
HBase
Table
as
Data
Structures
SortedMap(
RowKey,
List(
SortedMap(
Column,
List(
Value,
Timestamp
)
)
)
)
SortedMap(RowKey,
List(SortedMap(Column,
List(Value,
Timestamp))))
Web
Crawl
Example
• Canonical
use-‐case
for
BigTable
• Store
web
crawl
data
– Table
webtable
with
family
content
and
meta
– Row
is
reversed
URL
with
Columns
• content:data
stores
the
raw
crawled
data
• meta:language
stores
hRp
language
header
• meta:type
stores
hRp
content-‐type
header
– While
processing
raw
data
for
hyperlinks
and
images,
add
families
links
and
images
• links:<rurl>
column
for
each
hyperlink
• images:<rurl>
column
for
each
image
HBase
Clients
• NaGve
Java
Client/API
– get(Get
get),
put(Put
put),
delete(Delete
delete)
– getScanner(Scan
scan)
• Non-‐Java
Clients
– REST
server
– Avro
server
– ThriC
server
– Jython,
Scala,
Groovy
DSL
• TableInputFormat/TableOutputFormat
for
MapReduce
– HBase
as
MapReduce
source
and/or
target
• HBase
Shell
– JRuby
shell
adding
get,
put,
scan
and
admin
calls
HBase
Extensions
• Hive,
Pig,
Cascading
– Hadoop-‐targeted
MapReduce
tools
with
HBase
integraGon
• Sqoop
– Read
and
write
to
HBase
for
further
processing
in
Hadoop
• HBase
Explorer,
Nutch,
Heretrix
• SpringData?
(volunteers?)
• Karmasphere?
History
of
HBase
• November
2006
– Google
releases
paper
on
BigTable
• February
2007
– IniGal
HBase
prototype
created
as
Hadoop
contrib
• October
2007
– First
“useable”
HBase
(Hadoop
0.15.0)
• January
2008
– Hadoop
becomes
TLP,
HBase
becomes
subproject
• October
2008
– HBase
0.18.1
released
• January
2009
– HBase
0.19.0
• September
2009
– HBase
0.20.0
released
(Performance
Release)
• May
2010
– HBase
becomes
TLP
• June
2010
– HBase
0.89.20100621,
first
developer
release
• Imminent…
– HBase
0.90
release
(any
day
now)
Current
Project
Status
• HBase
0.90.x
“Advanced
Concepts”
– Master
Rewrite
–
More
Zookeeper
– MulG-‐DC
ReplicaGon
– Intra
Row
Scanning
– Further
opGmizaGons
on
algorithms
and
data
structures
– DiscreGonary
Access
Control
– Coprocessors
HBase
Users
• Adobe
• Facebook
• Mozilla
(Socorro)
• StumbleUpon
• Trend
Micro
(Advanced
Threat
Research)
• TwiRer
• Groups
at
Yahoo!
• Many
startups
with
amazing
services…
QuesGon?
Comparison
with
RDBMS
• Very
simple
example
use-‐case
– Please
note:
not
an
example
of
how
to
implement
this
with
HBase
necessarily
• System
to
store
a
shopping
cart
– Customers,
Products,
Orders
Simple
SQL
Schema
CREATE
TABLE
customers
(
customerid
UUID
PRIMARY
KEY,
name
TEXT,
email
TEXT)
CREATE
TABLE
products
(
producGd
UUID
PRIMARY
KEY,
name
TEXT,
price
DOUBLE)
CREATE
TABLE
orders
(
orderid
UUID
PRIMARY
KEY,
customerid
UUID
INDEXED
REFERENCES(customers.customerid),
date
TIMESTAMP,
total
DOUBLE)
CREATE
TABLE
orderproducts
(
orderid
UUID
INDEXED
REFERENCES(orders.orderid),
producGd
UUID
REFERENCES(products.producGd))
Simple
HBase
Schema
CREATE
TABLE
customers
(content,
orders)
CREATE
TABLE
products
(content)
CREATE
TABLE
orders
(content,
products)
Efficient
Queries
with
Both
• Get
name,
email,
orders
for
customers
• Get
name,
price
for
product
• Get
customer,
stamp,
total
for
order
• Get
list
of
products
in
order
Where
SQL
Makes
Life
Easy
• Joining
– In
a
single
query,
get
all
products
in
an
order
with
their
product
informaGon
• Secondary
Indexing
– Get
customerid
by
email
• ReferenGal
Integrity
– DeleGng
an
order
would
delete
links
out
of
‘orderproducts’
– ID
updates
propagate
• RealGme
Analysis
– GROUP
BY
and
ORDER
BY
allow
for
simple
staGsGcal
analysis
Where
HBase
Makes
Life
Easy
• Dataset
Scale
– We
have
1M
customers
and
100M
products
– Product
informaGon
includes
large
text
datasheet
or
PDF
files
– Want
to
track
every
Gme
a
customer
looks
at
a
product
page
• Read/Write
Scale
– Tables
distributed
across
nodes
means
reads/writes
are
fully
distributed
– Writes
are
extremely
fast
and
require
no
index
updates
• ReplicaGon
– Comes
for
free
• Batch
Analysis
– Massive
and
convoluted
SQL
queries
executed
serially
become
efficient
MapReduce
jobs
distributed
and
executed
in
parallel
Conclusion
• For
small
instances
of
simple/straigh}orward
systems,
relaGonal
databases
offer
a
much
more
convenient
way
to
model
and
access
data
– Can
outsource
most
work
to
transacGon
and
query
engine
– HBase
will
force
you
to
pull
complexity
into
ApplicaGon
layer
• Once
you
need
to
scale,
the
properGes
and
flexibility
of
HBase
can
relieve
you
from
the
headaches
associated
with
scaling
an
RDBMS
QuesGon?
HBase
Architecture
(cont.)
• Based
on
Log-‐Structured
Merge-‐Trees
(LSM-‐Trees)
• Inserts
are
done
in
write-‐ahead
log
first
• Data
is
stored
in
memory
and
flushed
to
disk
on
regular
intervals
or
based
on
size
• Small
flushes
are
merged
in
the
background
to
keep
number
of
files
small
• Reads
read
memory
stores
first
and
then
disk
based
files
second
• Deletes
are
handled
with
“tombstone”
markers
• Atomicity
on
row
level
no
maRer
how
many
columns
– keeps
locking
model
easy
HBase
Architecture
(cont.)
Write-‐Ahead-‐Log
(WAL)
Flow
Write-‐Ahead-‐Log
(cont.)
HFile
and
KeyValue
Raw
Data
View
$ ./bin/hbase org.apache.hadoop.hbase.io.hfile.HFile -f file:///tmp/
hbase-larsgeorge/hbase/testtable/272a63b23bdb5fae759be5192cabc0ce/
f1/4992515006010131591 -p
K: row1/f1:/1290345071149/Put/vlen=6 V: value1
K: row2/f1:/1290345078351/Put/vlen=6 V: value2
K: row3/f1:/1290345089750/Put/vlen=6 V: value3
K: row4/f1:/1290345095724/Put/vlen=6 V: value4
K: row5/f1:c1/1290347447541/Put/vlen=6 V: value5
K: row6/f1:c2/1290347461068/Put/vlen=6 V: value6
K: row7/f1:c1/1290347581879/Put/vlen=7 V: value10
K: row7/f1:c1/1290347469553/Put/vlen=6 V: value7
K: row7/f1:c10/1290348157074/DeleteColumn/vlen=0 V:
K: row7/f1:c10/1290347625771/Put/vlen=7 V: value11
K: row7/f1:c11/1290347971849/Put/vlen=7 V: value14
K: row7/f1:c12/1290347979559/Put/vlen=7 V: value15
K: row7/f1:c13/1290347986384/Put/vlen=7 V: value16
K: row7/f1:c2/1290347569785/Put/vlen=6 V: value8
K: row7/f1:c3/1290347575521/Put/vlen=6 V: value9
K: row7/f1:c8/1290347638008/Put/vlen=7 V: value13
K: row7/f1:c9/1290347632777/Put/vlen=7 V: value12
MemStores
• ACer
data
is
wriRen
to
the
WAL
the
RegionServer
saves
KeyValues
in
memory
store
• Flush
to
disk
based
on
size,
see
hbase.hregion.memstore.flush.size
• Default
size
is
64MB
• Uses
snapshot
mechanism
to
write
flush
to
disk
while
sGll
serving
from
it
and
accepGng
new
data
at
the
same
Gme
• Snapshots
are
released
when
flush
has
succeeded
Block
Cache
• Acts
as
very
large,
in-‐memory
distributed
cache
• Assigned
a
large
part
of
the
JVM
heap
in
the
RegionServer
process,
see
hfile.block.cache.size
• OpGmizes
reads
on
subsequent
columns
and
rows
• Has
priority
to
keep
“in-‐memory”
column
families
in
cache
if(inMemory) {
this.priority = BlockPriority.MEMORY;
} else {
this.priority = BlockPriority.SINGLE;
}
• Cache
needs
to
be
used
properly
to
get
best
read
performance
– Turn
off
block
cache
on
operaGons
that
cause
large
churn
– Store
related
data
“close”
to
each
other
• Uses
LRU
cache
with
threaded
(asynchronous)
evicGons
based
on
prioriGes
CompacGons
• General
Concepts
– Two
types:
Minor
and
Major
CompacGons
– Asynchronous
and
transparent
to
client
– Manage
file
bloat
from
MemStore
flushes
• Minor
CompacGons
– Combine
last
“few”
flushes
– Triggered
by
number
of
storage
files
• Major
CompacGons
– Rewrite
all
storage
files
– Drop
deleted
data
and
those
values
exceeding
TTL
and/or
number
of
versions
– Triggered
by
Gme
threshold
– Cannot
be
scheduled
automaGcally
starGng
at
a
specific
Gme
(bummer!)
– May
(most
definitely)
tax
overall
HDFS
IO
performance
Tip:
Disable
major
compacGons
and
schedule
to
run
manually
(e.g.
cron)
at
off-‐peak
Gmes
Region
Splits
• Triggered
by
configured
maximum
file
size
of
any
store
file
– This
is
checked
directly
aAer
the
compacGon
call
to
ensure
store
files
are
actually
approaching
the
threshold
• Runs
as
asynchronous
thread
on
RegionServer
•
Splits
are
fast
and
nearly
instant
– Reference
files
point
to
original
region
files
and
represent
each
half
of
the
split
• CompacGons
take
care
of
spli~ng
original
files
into
new
region
directories
ReplicaGon
QuesGon?
MapReduce
with
HBase
• Framework
to
use
HBase
as
source
and/or
sink
for
MapReduce
jobs
• Thin
layer
over
naGve
Java
API
• Provides
helper
class
to
set
up
jobs
easier
TableMapReduceUtil.initTableMapperJob(
“test”, scan, MyMapper.class,
ImmutableBytesWritable.class,
RowResult.class, job);
TableMapReduceUtil.initTableReducerJob(
“table”, MyReducer.class, job);
MapReduce
with
HBase
(cont.)
• Special
use-‐case
in
regards
to
Hadoop
• Tables
are
sorted
and
have
unique
keys
– OCen
we
do
not
need
a
Reducer
phase
– Combiner
not
needed
• Need
to
make
sure
load
is
distributed
properly
by
randomizing
keys
(or
use
bulk
import)
• ParGal
or
full
table
scans
possible
• Scans
are
very
efficient
as
they
make
use
of
block
caches
– But
then
make
sure
you
do
not
create
to
much
churn,
or
beRer
switch
caching
off
when
doing
full
table
scans.
• Can
use
filters
to
limit
rows
being
processed
TableInputFormat
• Transforms
a
HBase
table
into
a
source
for
MapReduce
jobs
• Internally
uses
a
TableRecordReader
which
wraps
a
Scan
instance
– Supports
restarts
to
handle
temporary
issues
• Splits
table
by
region
boundaries
and
stores
current
region
locality
TableOutputFormat
• Allows
to
use
HBase
table
as
output
target
• Put
and
Delete
support
from
mapper
or
reducer
class
• Uses
TableOutputCommiRer
to
write
data
• Disables
auto-‐commit
on
table
to
make
use
of
client
side
write
buffer
• Handles
final
flush
in
close()
HFileOutputFormat
• Used
to
bulk
load
data
into
HBase
• Bypasses
normal
API
and
generates
low-‐level
store
files
• Prepares
files
for
final
bulk
insert
• Needs
special
handling
of
sort
order
and
parGGoning
• Only
supports
one
column
family
(for
now)
• Can
load
bulk
updates
into
exisGng
tables
MapReduce
Helper
• TableMapReduceUGl
• IdenGtyTableMapper
– Passes
on
key
and
value,
where
value
is
a
Result
instance
and
key
is
set
to
value.getRow()
• IdenGtyTableReducer
– Stores
values
into
HBase,
must
be
Put
or
Delete
instances
• HRegionParGGoner
– Not
set
by
default,
use
it
to
control
parGoning
on
Hadoop
level
Custom
MapReduce
over
Tables
• No
requirement
to
use
provided
framework
• Can
read
from
or
write
to
one
or
many
tables
in
mapper
and
reducer
• Can
split
not
on
regions
but
arbitrary
boundaries
• Make
sure
to
use
write
buffer
in
OutputFormat
to
get
best
performance
(do
not
forget
to
call
flushCommits()
at
the
end!)
QuesGon?
Advanced
Techniques
• Key/Table
Design
• DDI
• SalGng
• Hashing
vs.
SequenGal
Keys
• ColumnFamily
vs.
Column
• Using
BloomFilter
• Data
Locality
• checkAndPut()
and
checkAndDelete()
• Coprocessors
Key/Table
Design
• Crucial
to
gain
best
performance
– Why
do
I
need
to
know?
Well,
you
also
need
to
know
that
RDBMS
is
only
working
well
when
columns
are
indexed
and
query
plan
is
OK
• Absence
of
secondary
indexes
forces
use
of
row
key
or
column
name
sorGng
• Transfer
mulGple
indexes
into
one
– Generate
large
table
-‐>
Good
since
fits
architecture
and
spreads
across
cluster
DDI
• Stands
for
DenormalizaGon,
DuplicaGon
and
Intelligent
Keys
• Needed
to
overcome
shortcomings
of
architecture
• DenormalizaGon
-‐>
Replacement
for
JOINs
• DuplicaGon
-‐>
Design
for
reads
• Intelligent
Keys
-‐>
Implement
indexing
and
sorGng,
opGmize
reads
Pre-‐materialize
Everything
• Achieve
one
read
per
customer
request
if
possible
• Otherwise
keep
at
lowest
number
• Reads
between
10ms
(cache
miss)
and
1ms
(cache
hit)
• Use
MapReduce
to
compute
exacts
in
batch
• Store
and
merge
updates
live
• Use
incrementColumnValue
MoRo:
“Design
for
Reads”
SalGng
• Prefix
row
keys
to
gain
spread
• Use
well
known
or
numbered
prefixes
• Use
modulo
to
spread
across
servers
• Enforce
common
data
stay
close
to
each
other
for
subsequent
scanning
or
MapReduce
processing
0_rowkey1, 1_rowkey2, 2_rowkey3
0_rowkey4, 1_rowkey5, 2_rowkey6
• Sorted
by
prefix
first
0_rowkey1
0_rowkey4
1_rowkey2
1_rowkey5
…
Hashing
vs.
SequenGal
Keys
• Uses
hashes
for
best
spread
– Use
for
example
MD5
to
be
able
to
recreate
key
• Key
=
MD5(customerID)
– Counter
producGve
for
range
scans
• Use
sequenGal
keys
for
locality
– Makes
use
of
block
caches
– May
tax
one
server
overly,
may
be
avoided
by
salGng
or
spli~ng
regions
while
keeping
them
small
ColumnFamily
vs.
Column
• Use
only
a
few
column
families
– Causes
many
files
that
need
to
stay
open
per
region
plus
class
overhead
per
family
• Best
used
when
logical
separaGon
between
data
and
meta
columns
• SorGng
per
family
can
be
used
to
convey
applicaGon
logic
or
access
paRern
• Define
compression
or
in-‐memory
aRributes
to
opGmize
access
and
performance
Using
Bloomfilters
• Defines
a
filter
that
allows
to
determine
if
a
store
file
does
not
contain
a
row
or
column
• Error
rate
can
control
overhead
but
is
usually
very
low,
1%
or
less
• Stored
with
each
storage
file
on
flush
and
compacGons
• Good
for
large
regions
with
many
disGnct
row
keys
and
many
expected
misses
• Trick:
“OpGmize”
compacGon
to
gain
advantage
while
scanning
files
Data
Locality
• Provided
by
DFSClient
• Transparent
for
Hbase
• ACer
restart,
data
may
not
be
local
– Work
is
done
to
improve
on
this
• Over
Gme
and
caused
be
compacGons
data
is
stored
where
it
is
needed,
i.e.
local
to
RegionServer
• Could
enforce
major
compacGon
before
starGng
MapReduce
jobs
checkAndPut()
and
checkAndDelete()
• Helps
with
atomic
operaGons
on
single
row
• Absence
of
value
is
treated
as
check
for
non-‐
existence
public boolean checkAndPut(final byte[] row,
final byte[] family, final byte[] qualifier,
final byte[] value, final Put put)
public boolean checkAndDelete(final byte[] row,
final byte[] family, final byte[] qualifier,
final byte[] value, final Delete delete)
Locks
• Locks
can
be
set
explicitly
for
client
operaGons
• Lock
a
row
from
modificaGons
by
other
clients
– Clients
block
on
locked
rows
–>
keep
locking
reasonably
short!
• Use
HTable’s
lockRow
to
acquire
and
unlockRow
to
release
• Locks
are
guarded
by
leases
on
RegionServer
and
configured
with
hbase.regionserver.lease.period
– By
default
set
to
60
seconds
– Leases
are
refreshed
by
any
mutaGon
call,
e.g.
get(),
put()
or
delete().
Coprocessors
• New
addiGon
to
feature
set
• Based
on
talk
by
Jeff
Dean
at
LADIS
2009
– Run
arbitrary
code
on
each
region
in
RegionServer
– High
level
call
interface
for
clients
• Calls
are
addressed
to
rows
or
ranges
of
rows
while
Coprocessors
client
library
resolves
locaGons
• Calls
to
mulGple
rows
are
atomically
split
– Provides
model
for
distributed
services
• AutomaGc
scaling,
load
balancing,
request
rouGng
Coprocessors
in
HBase
• Use
for
efficient
computaGonal
parallelism
• Secondary
indexing
(HBASE-‐2038)
• Column
Aggregates
(HBASE-‐1512)
– SQL-‐like
sum(),
avg(),
max(),
min(),
etc.
• Access
control
(HBASE-‐3025,
HBASE-‐3045)
– Provide
basic
access
control
• Table
Metacolumns
• New
filtering
– predicate
pushdown
• Table/Region
access
staGsGcs
• HLog
extensions
(HBASE-‐3257)
Coprocessors
in
HBase
• Java
classes
implemenGng
interfaces
• Load
through
configuraGon
or
table
aRribute
'COPROCESSOR$1' => 'hdfs://localhost:8020/
hbase/coprocessors/test.jar:Test:1000‘
'COPROCESSOR$2' => '/hbase/coprocessors/
test2.jar:AnotherTest:1001‘
• Can
be
chained
like
servlet
filters
• Dynamic
RPC
allows
funcGonal
extensibility
Coprocessor
and
RegionObserver
• The
Coprocessor
interface
defines
these
hooks
– preOpen,
postOpen:
Called
before
and
aCer
the
region
is
reported
as
online
to
the
master
– preFlush,
postFlush:
Called
before
and
aCer
the
memstore
is
flushed
into
a
new
store
file
– preCompact,
postCompact:
Called
before
and
aCer
compacGon
– preSplit,
postSplit:
Called
aCer
the
region
is
split
– preClose,
postClose:
Called
before
and
aCer
the
region
is
reported
as
closed
to
the
master
Coprocessor
and
RegionObserver
• The
RegionObserver
interface
is
defines
these
hooks:
– preGet,
postGet:
Called
before
and
aCer
a
client
makes
a
Get
request
– preExists,
postExists:
Called
before
and
aCer
the
client
tests
for
existence
using
a
Get
– prePut,
postPut:
Called
before
and
aCer
the
client
stores
a
value
– preDelete,
postDelete:
Called
before
and
aCer
the
client
deletes
a
value
– preScannerOpen,
postScannerOpen:
Called
before
and
aCer
the
client
opens
a
new
scanner
– preScannerNext,
postScannerNext:
Called
before
and
aCer
the
client
asks
for
the
next
row
on
a
scanner
– preScannerClose,
postScannerClose:
Called
before
and
aCer
the
client
closes
a
scanner
– preCheckAndPut,
postCheckAndPut:
Called
before
and
aCer
the
client
calls
checkAndPut()
– preCheckAndDelete,
postCheckAndDelete:
Called
before
and
aCer
the
client
calls
checkAndDelete()
RegionObserver
Call
Sequence
Example
public class RBACCoprocessor extends BaseRegionObserver {
@Override
public List preGet(CoprocessorEnvironment e, Get get,
List results) throws CoprocessorException {
// check permissions...
if (access_not_allowed) {
throw new AccessDeniedException(
"User is not allowed to access.");
}
return results;
}
// override prePut(), preDelete(), etc.
}
Endpoint
and
Dynamic
RPC
HBase
and
Indexing
• Secondary
indexing
or
search?
• HBasene
– Port
of
Lucandra
• Nutch,
Solr,
Lucene
• ITHBase
and
IHBase
– Moved
out
from
contrib
into
GitHub
• HSearch
Secondary
Index
or
Search?
• Can
keep
“lookup”
tables
– But
could
also
be
in
the
same
table
– Could
even
be
in
the
same
row
• Use
ColumnFamily
per
index
(but
keep
number
low)
• Make
use
of
column
sorGng
• Does
it
fit
your
access
paRern?
• How
to
guarantee
updates?
– Use
some
sort
of
“transacGon”
• Offer
sorGng
in
one
direcGon
Example:
HBasene
• Based
on
Lucandra
• Implements
Lucene
API
over
HBase
• Stores
term
vector
as
rows
in
a
table
– Each
row
is
one
term
and
the
columns
are
the
index
with
value
being
the
posiGon
in
the
text
• Document
fields
are
stored
as
columns
using
“field/term”
combinaGons
• Perform
boolean
operaGons
in
code
ITHBase
and
IHBase
• Provided
by
contributors
• May
not
be
supporGng
latest
HBase
release
• Indexed-‐TransacGonal
HBase
– Extends
RegionServer
code
– Intrusive
– Provides
noGon
of
TransacGons
over
rows
– Maintains
lookup
tables
• Indexed
HBase
– Implemented
by
Powerset/MicrosoC
– Support?
– Intrusive
– Keeps
state
in
memory
– Hooks
into
region
operaGons
to
maintain
state
– Replace
with
Coprocessors
(HBASE-‐2038)
Custom
Search
Index
• Facebook
is
using
Cassandra
to
power
inbox
search
– 150TB
of
data
stored
– Row
is
user
inbox
ID
– Uses
super
columns
to
index
terms
– Each
column
is
document
that
contains
the
term
• Make
use
of
parGal
scans
– Can
be
done
on
row
and
column
level
“Find
email
with
albert*”
• SorGng
of
columns
allows
for
performance
opGmizaGons
during
term
retrieval
QuesGons?
Get documents about "