Discussion:
How to deal with "too many fetch failures"?
yang song
2009-08-19 05:23:13 UTC
Permalink
Hello, all
I have met the problem "too many fetch failures" when I submit a big
job(e.g. tasks>10000). And I know this error occurs when several reducers
are unable to fetch the given map output. However, I'm sure slaves can
contact each other.
I feel puzzled and have no idea to deal with it. Maybe the network
transfer is bad, but how can I solve it? Increase
mapred.reduce.parallel.copies and mapred.reduce.copy.backoff can make
changes?
Thank you!
Inifok
Ted Dunning
2009-08-19 07:44:23 UTC
Permalink
Which version of hadoop are you running?
Post by yang song
Hello, all
I have met the problem "too many fetch failures" when I submit a big
job(e.g. tasks>10000). And I know this error occurs when several reducers
are unable to fetch the given map output. However, I'm sure slaves can
contact each other.
I feel puzzled and have no idea to deal with it. Maybe the network
transfer is bad, but how can I solve it? Increase
mapred.reduce.parallel.copies and mapred.reduce.copy.backoff can make
changes?
Thank you!
Inifok
--
Ted Dunning, CTO
DeepDyve
yang song
2009-08-19 12:19:53 UTC
Permalink
I'm sorry, the version is 0.19.1
Post by Ted Dunning
Which version of hadoop are you running?
Post by yang song
Hello, all
I have met the problem "too many fetch failures" when I submit a big
job(e.g. tasks>10000). And I know this error occurs when several reducers
are unable to fetch the given map output. However, I'm sure slaves can
contact each other.
I feel puzzled and have no idea to deal with it. Maybe the network
transfer is bad, but how can I solve it? Increase
mapred.reduce.parallel.copies and mapred.reduce.copy.backoff can make
changes?
Thank you!
Inifok
--
Ted Dunning, CTO
DeepDyve
Ted Dunning
2009-08-19 18:17:26 UTC
Permalink
I think I remember something about 19.1 in which certain failures would
cause this. Consider using an updated 19 or moving to 20 as well.
Post by yang song
I'm sorry, the version is 0.19.1
yang song
2009-08-20 05:39:51 UTC
Permalink
Thank you Ted. Update current cluster is a huge work, we don't want to
do so. Could you tell me how 0.19.1 causes certain failures in detail?
Thanks again.
Post by Ted Dunning
I think I remember something about 19.1 in which certain failures would
cause this. Consider using an updated 19 or moving to 20 as well.
Post by yang song
I'm sorry, the version is 0.19.1
Ted Dunning
2009-08-20 06:25:55 UTC
Permalink
I think that the problem that I am remembering was due to poor recovery from
this problem. The underlying fault is likely due to poor connectivity
between your machines. Test that all members of your cluster can access all
others on all ports used by hadoop.

See here for hints: http://markmail.org/message/lgafou6d434n2dvx
Post by yang song
Thank you Ted. Update current cluster is a huge work, we don't want to
do so. Could you tell me how 0.19.1 causes certain failures in detail?
Thanks again.
Post by Ted Dunning
I think I remember something about 19.1 in which certain failures would
cause this. Consider using an updated 19 or moving to 20 as well.
Post by yang song
I'm sorry, the version is 0.19.1
--
Ted Dunning, CTO
DeepDyve
Jason Venner
2009-08-20 06:59:06 UTC
Permalink
The number 1 cause of this is something that causes a connection to get a
map output to fail. I have seen:
1) firewall
2) misconfigured ip addresses (ie: the task tracker attempting the fetch
received an incorrect ip address when it looked up the name of the
tasktracker with the map segment)
3) rare, the http server on the serving tasktracker is overloaded due to
insufficient threads or listen backlog, this can happen if the number of
fetches per reduce is large and the number of reduces or the number of maps
is very large

There are probably other cases, this recently happened to me when I had 6000
maps and 20 reducers on a 10 node cluster, which I believe was case 3 above.
Since I didn't actually need to reduce ( I got my summary data via counters
in the map phase) I never re-tuned the cluster.
Post by Ted Dunning
I think that the problem that I am remembering was due to poor recovery from
this problem. The underlying fault is likely due to poor connectivity
between your machines. Test that all members of your cluster can access all
others on all ports used by hadoop.
See here for hints: http://markmail.org/message/lgafou6d434n2dvx
Post by yang song
Thank you Ted. Update current cluster is a huge work, we don't want to
do so. Could you tell me how 0.19.1 causes certain failures in detail?
Thanks again.
Post by Ted Dunning
I think I remember something about 19.1 in which certain failures would
cause this. Consider using an updated 19 or moving to 20 as well.
Post by yang song
I'm sorry, the version is 0.19.1
--
Ted Dunning, CTO
DeepDyve
--
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals
Koji Noguchi
2009-08-20 17:14:17 UTC
Permalink
Probably unrelated to your problem, but one extreme case I've seen,
a user's job with large gzip inputs (non-splittable),
20 mappers 800 reducers. Each map outputted like 20G.
Too many reducers were hitting a single node as soon as a mapper finished.

I think we tried something like

mapred.reduce.parallel.copies=1
(to reduce number of reducer copier threads)
mapred.reduce.slowstart.completed.maps=1.0
(so that reducers would have 20 mappers to pull from, instead of 800
reducers hitting 1 mapper node as soon as it finishes.)


Koji
Post by Jason Venner
The number 1 cause of this is something that causes a connection to get a
1) firewall
2) misconfigured ip addresses (ie: the task tracker attempting the fetch
received an incorrect ip address when it looked up the name of the
tasktracker with the map segment)
3) rare, the http server on the serving tasktracker is overloaded due to
insufficient threads or listen backlog, this can happen if the number of
fetches per reduce is large and the number of reduces or the number of maps
is very large
There are probably other cases, this recently happened to me when I had 6000
maps and 20 reducers on a 10 node cluster, which I believe was case 3 above.
Since I didn't actually need to reduce ( I got my summary data via counters
in the map phase) I never re-tuned the cluster.
Post by Ted Dunning
I think that the problem that I am remembering was due to poor recovery from
this problem. The underlying fault is likely due to poor connectivity
between your machines. Test that all members of your cluster can access all
others on all ports used by hadoop.
See here for hints: http://markmail.org/message/lgafou6d434n2dvx
Post by yang song
Thank you Ted. Update current cluster is a huge work, we don't want to
do so. Could you tell me how 0.19.1 causes certain failures in detail?
Thanks again.
Post by Ted Dunning
I think I remember something about 19.1 in which certain failures would
cause this. Consider using an updated 19 or moving to 20 as well.
Post by yang song
I'm sorry, the version is 0.19.1
--
Ted Dunning, CTO
DeepDyve
Arun C Murthy
2009-08-19 16:31:21 UTC
Permalink
I'd dig around a bit more to check if it's there it's caused by a
specific set of nodes... i.e. are maps on specific tasktrackers
failing in this manner?

Arun
Post by yang song
Hello, all
I have met the problem "too many fetch failures" when I submit a big
job(e.g. tasks>10000). And I know this error occurs when several reducers
are unable to fetch the given map output. However, I'm sure slaves can
contact each other.
I feel puzzled and have no idea to deal with it. Maybe the network
transfer is bad, but how can I solve it? Increase
mapred.reduce.parallel.copies and mapred.reduce.copy.backoff can make
changes?
Thank you!
Inifok
谭东
2009-08-20 06:44:32 UTC
Permalink
More fewer reducers there are, More data each reducer will deal with, More
network transmission each reducer will attached to, and more probably one
reducer will fail.
SO INCREMENT your reducers, then try again.
Post by Ted Dunning
I think that the problem that I am remembering was due to poor recovery
from
Post by Ted Dunning
this problem. The underlying fault is likely due to poor connectivity
between your machines. Test that all members of your cluster can access
all
Post by Ted Dunning
others on all ports used by hadoop.
See here for hints: http://markmail.org/message/lgafou6d434n2dvx
Thank you Ted. Update current cluster is a huge work, we don't want to
do so. Could you tell me how 0.19.1 causes certain failures in detail?
Thanks again.
Post by Ted Dunning
I think I remember something about 19.1 in which certain failures would
cause this. Consider using an updated 19 or moving to 20 as well.
Post by yang song
I'm sorry, the version is 0.19.1
--
Ted Dunning, CTO
DeepDyve
谭东
2009-08-20 06:56:12 UTC
Permalink
Fewer reducers there are, More data each reducer will deal with, More
network transmission each reducer will be attached to, and more probably one
reducer will fail.
SO INCREMENT your reducers, then try again.
Post by Ted Dunning
I think that the problem that I am remembering was due to poor recovery
from
Post by Ted Dunning
this problem. The underlying fault is likely due to poor connectivity
between your machines. Test that all members of your cluster can access
all
Post by Ted Dunning
others on all ports used by hadoop.
See here for hints: http://markmail.org/message/lgafou6d434n2dvx
Thank you Ted. Update current cluster is a huge work, we don't want to
do so. Could you tell me how 0.19.1 causes certain failures in detail?
Thanks again.
Post by Ted Dunning
I think I remember something about 19.1 in which certain failures would
cause this. Consider using an updated 19 or moving to 20 as well.
Post by yang song
I'm sorry, the version is 0.19.1
--
Ted Dunning, CTO
DeepDyve
Loading...