RAID vs. JBOD

Discussion:

RAID vs. JBOD

David B. Ritch

2009-01-11 21:23:01 UTC

How well does Hadoop handle multiple independent disks per node?

I have a cluster with 4 identical disks per node. I plan to use one
disk for OS and temporary storage, and dedicate the other three to
HDFS. Our IT folks have some disagreement as to whether the three disks
should be striped, or treated by HDFS as three independent disks. Could
someone with more HDFS experience comment on the relative advantages and
disadvantages to each approach?

Here are some of my thoughts. It's a bit easier to manage a 3-disk
striped partition, and we wouldn't have to worry about balancing files
between them. Single-file I/O should be considerably faster. On the
other hand, I would expect typical use to require multiple files reads
or write simultaneously. I would expect Hadoop to be able to manage
read/write to/from the disks independently. Managing 3 streams to 3
independent devices would likely result in less disk head movement, and
therefore better performance. I would expect Hadoop to be able to
balance load between the disks fairly well. Availability doesn't really
differentiate between the two approaches - if a single disk dies, the
striped array would go down, but all its data should be replicated on
another datanode, anyway. And besides, I understand that datanode will
shut down a node, even if only one of 3 independent disks crashes.

So - any one want to agree or disagree with these thoughts? Anyone have
any other ideas, or - better - benchmarks and experience with layouts
like these two?

Thanks!

David

Jason Venner

2009-01-11 22:55:22 UTC

Permalink

If you put your dfs directory as a set of comma separated tokens you
will do fine.

<property>
<name>dfs.data.dir</name>
<value>${hadoop.tmp.dir}/dfs/data</value>
<description>Determines where on the local filesystem an DFS data node
should store its blocks. If this is a comma-delimited
list of directories, then data will be stored in all named
directories, typically on different devices.
Directories that do not exist are ignored.
</description>
</property>

The namenode does a lot of small writes, so raid 1, 10 is better.

Also it having the file system mounts for the dfs.data.dir be noatime
and nodiratime makes a significant performance difference.

Post by David B. Ritch
How well does Hadoop handle multiple independent disks per node?
I have a cluster with 4 identical disks per node. I plan to use one
disk for OS and temporary storage, and dedicate the other three to
HDFS. Our IT folks have some disagreement as to whether the three disks
should be striped, or treated by HDFS as three independent disks. Could
someone with more HDFS experience comment on the relative advantages and
disadvantages to each approach?
Here are some of my thoughts. It's a bit easier to manage a 3-disk
striped partition, and we wouldn't have to worry about balancing files
between them. Single-file I/O should be considerably faster. On the
other hand, I would expect typical use to require multiple files reads
or write simultaneously. I would expect Hadoop to be able to manage
read/write to/from the disks independently. Managing 3 streams to 3
independent devices would likely result in less disk head movement, and
therefore better performance. I would expect Hadoop to be able to
balance load between the disks fairly well. Availability doesn't really
differentiate between the two approaches - if a single disk dies, the
striped array would go down, but all its data should be replicated on
another datanode, anyway. And besides, I understand that datanode will
shut down a node, even if only one of 3 independent disks crashes.
So - any one want to agree or disagree with these thoughts? Anyone have
any other ideas, or - better - benchmarks and experience with layouts
like these two?
Thanks!
David

David B. Ritch

2009-01-12 12:35:34 UTC

Permalink

Thank you - yes, I'm fairly confident that it will work either way. I'm
trying to find out whether there is an established best practice, and
the performance impact of the decision between RAID 0 and JBOD.
I'll check out the noatime and nodiratime for their effect on our
performance - thanks for that suggestion, as well.

David

Post by Jason Venner
If you put your dfs directory as a set of comma separated tokens you
will do fine.
<property>
<name>dfs.data.dir</name>
<value>${hadoop.tmp.dir}/dfs/data</value>
<description>Determines where on the local filesystem an DFS data node
should store its blocks. If this is a comma-delimited
list of directories, then data will be stored in all named
directories, typically on different devices.
Directories that do not exist are ignored.
</description>
</property>
The namenode does a lot of small writes, so raid 1, 10 is better.
Also it having the file system mounts for the dfs.data.dir be noatime
and nodiratime makes a significant performance difference.

Brian Vargas

2009-01-12 13:03:08 UTC

Permalink

David,

As I understand it, you will theoretically get better performance from a
JBOD configuration than a RAID configuration. In a RAID configuration,
you have to wait for the slowest disk in the array to complete before
the entire IO operation can complete, making the average IO time
equivalent to the slowest disk. In a JBOD configuration, operations on
a faster disks will complete independently of the slowest disk, making
the average IO time for the node necessarily faster than the slowest
disk (unless all disks are equally slow).

Whether it would be a noticeable gain is questionable, though. I doubt
it would make enough difference to provide a good reason to depart from
whichever you feel is easiest to manage.

And you don't need the redundancy of RAID, since HDFS does that using
replication between nodes, so there's no loss there.

Brian

Post by David B. Ritch
Thank you - yes, I'm fairly confident that it will work either way. I'm
trying to find out whether there is an established best practice, and
the performance impact of the decision between RAID 0 and JBOD.
I'll check out the noatime and nodiratime for their effect on our
performance - thanks for that suggestion, as well.
David

Colin Evans

2009-01-12 17:17:00 UTC

Permalink

Currently, Hadoop does round-robin allocation of blocks and data
across multiple JBOD disks. We did some testing and found that there
weren't significant differences between RAID-0 and JBOD. We went with
JBOD because we figured that RAID-0 has a higher failure rate than
JBOD -- any disk failure in a 3-disk RAID-0 configuration causes the
whole node to go down, but if there is a single disk failure in a JBOD
configuration, Hadoop will go on serving from the other disks.

Post by David B. Ritch
How well does Hadoop handle multiple independent disks per node?
I have a cluster with 4 identical disks per node. I plan to use one
disk for OS and temporary storage, and dedicate the other three to
HDFS. Our IT folks have some disagreement as to whether the three disks
should be striped, or treated by HDFS as three independent disks.
Could
someone with more HDFS experience comment on the relative advantages and
disadvantages to each approach?
Here are some of my thoughts. It's a bit easier to manage a 3-disk
striped partition, and we wouldn't have to worry about balancing files
between them. Single-file I/O should be considerably faster. On the
other hand, I would expect typical use to require multiple files reads
or write simultaneously. I would expect Hadoop to be able to manage
read/write to/from the disks independently. Managing 3 streams to 3
independent devices would likely result in less disk head movement, and
therefore better performance. I would expect Hadoop to be able to
balance load between the disks fairly well. Availability doesn't really
differentiate between the two approaches - if a single disk dies, the
striped array would go down, but all its data should be replicated on
another datanode, anyway. And besides, I understand that datanode will
shut down a node, even if only one of 3 independent disks crashes.
So - any one want to agree or disagree with these thoughts? Anyone have
any other ideas, or - better - benchmarks and experience with layouts
like these two?
Thanks!
David

David Ritch

2009-01-12 21:00:18 UTC

Permalink

Thank you! I'm glad to hear that you have actually tested this.

I believe that a failure of any disk - even with JBOD - will cause dataNode
to bring the node down. Presumably, we could bring it right back up, but
this does sort of diminish the availability argument for JBOD.

Sounds like it's basically a toss-up. I'm a bit concerned about the
potential for uneven distribution - both of amount of data, and of transfer
load - across the spindles. Unless I hear otherwise, I will probably go
with RAID-0.

Currently, Hadoop does round-robin allocation of blocks and data across
multiple JBOD disks. We did some testing and found that there weren't
significant differences between RAID-0 and JBOD. We went with JBOD because
we figured that RAID-0 has a higher failure rate than JBOD -- any disk
failure in a 3-disk RAID-0 configuration causes the whole node to go down,
but if there is a single disk failure in a JBOD configuration, Hadoop will
go on serving from the other disks.

John Kane

2009-01-12 21:14:42 UTC

Permalink

We have been using 2U boxes with 12x1TB disks. The first disk is used for
OS/Scratch/Laziness, the other 11 disks are formatted as individual (~900GB)
volumes and then mounted separately. We have /data-[a-k] mounted and
configured in our cluster and have not had any issues with unbalanced
loading. We do have varied sizes (from small to really large) files and
hadoop just seems to figure it out for us.

We have had single drive failures. We just bounce the datanode software and
all is happy. When we get around to replacing the failed drive (I just about
to go do one), we format it, mount it and then bounce the datanode. That
replacement volume is now not very well balanced, but it is not typically an
issue for us, we add lots of data every day so it does get filled up. We
have run the rebalancer to address huge disparites in node utilization (like
when we add a new node). That just made us feel better more than anything
else.

Cheers

Post by David Ritch
Thank you! I'm glad to hear that you have actually tested this.
I believe that a failure of any disk - even with JBOD - will cause dataNode
to bring the node down. Presumably, we could bring it right back up, but
this does sort of diminish the availability argument for JBOD.
Sounds like it's basically a toss-up. I'm a bit concerned about the
potential for uneven distribution - both of amount of data, and of transfer
load - across the spindles. Unless I hear otherwise, I will probably go
with RAID-0.

because

we figured that RAID-0 has a higher failure rate than JBOD -- any disk
failure in a 3-disk RAID-0 configuration causes the whole node to go

down,

but if there is a single disk failure in a JBOD configuration, Hadoop

will

go on serving from the other disks.

Runping Qi

2009-01-14 21:55:33 UTC

Permalink

Hi,

We at Yahoo did some Hadoop benchmarking experiments on clusters with JBOD
and RAID0. We found that under heavy loads (such as gridmix), JBOD cluster
performed better.

Gridmix tests:

Load: gridmix2
Cluster size: 190 nodes
Test results:

RAID0: 75 minutes
JBOD: 67 minutes
Difference: 10%

Tests on HDFS writes performances

We ran map only jobs writing data to dfs concurrently on different clusters.
The overall dfs write throughputs on the jbod cluster are 30% (with a 58
nodes cluster) and 50% (with an 18 nodes cluster) better than that on the
raid0 cluster, respectively.

To understand why, we did some file level benchmarking on both clusters.
We found that the file write throughput on a JBOD machine is 30% higher than
that on a comparable machine with RAID0. This performance difference may be
explained by the fact that the throughputs of different disks can vary 30%
to 50%. With such variations, the overall throughput of a raid0 system may
be bottlenecked by the slowest disk.

-- Runping

Steve Loughran

2009-01-15 10:41:13 UTC

Permalink

Post by Runping Qi
Hi,
We at Yahoo did some Hadoop benchmarking experiments on clusters with JBOD
and RAID0. We found that under heavy loads (such as gridmix), JBOD cluster
performed better.
Load: gridmix2
Cluster size: 190 nodes
RAID0: 75 minutes
JBOD: 67 minutes
Difference: 10%
Tests on HDFS writes performances
We ran map only jobs writing data to dfs concurrently on different clusters.
The overall dfs write throughputs on the jbod cluster are 30% (with a 58
nodes cluster) and 50% (with an 18 nodes cluster) better than that on the
raid0 cluster, respectively.
To understand why, we did some file level benchmarking on both clusters.
We found that the file write throughput on a JBOD machine is 30% higher than
that on a comparable machine with RAID0. This performance difference may be
explained by the fact that the throughputs of different disks can vary 30%
to 50%. With such variations, the overall throughput of a raid0 system may
be bottlenecked by the slowest disk.
-- Runping

This is really interesting. Thank you for sharing these results!

Presumably the servers were all set up with "nominally" homogenous
hardware? And yet still the variations existed. That would be something
to experiment with on new versus old clusters to see if it gets worse
over time.

Here we have a batch of desktop workstations all bought at the same
time, to the same spec, but one of them, "lucky" is more prone to race
conditions than any of the others. We don't know why, and assume its do
with the (multiple) Xeon CPU chips being at different ends of the bell
curve or something. all we know is: test on that box before shipping to
find race conditions early.

-steve

Runping Qi

2009-01-15 15:28:05 UTC

Permalink

Yes, all the machines in the tests are new, with the same spec.
The 30% to 50% throughput variations of the disks were observed on the disks
of the same machines.

Runping

Post by Steve Loughran

This is really interesting. Thank you for sharing these results!
Presumably the servers were all set up with "nominally" homogenous
hardware? And yet still the variations existed. That would be something
to experiment with on new versus old clusters to see if it gets worse
over time.
Here we have a batch of desktop workstations all bought at the same
time, to the same spec, but one of them, "lucky" is more prone to race
conditions than any of the others. We don't know why, and assume its do
with the (multiple) Xeon CPU chips being at different ends of the bell
curve or something. all we know is: test on that box before shipping to
find race conditions early.
-steve