How to deal with "too many fetch failures"?

Discussion:

yang song

2009-08-19 05:23:13 UTC

Hello, all
I have met the problem "too many fetch failures" when I submit a big
job(e.g. tasks>10000). And I know this error occurs when several reducers
are unable to fetch the given map output. However, I'm sure slaves can
contact each other.
I feel puzzled and have no idea to deal with it. Maybe the network
transfer is bad, but how can I solve it? Increase
mapred.reduce.parallel.copies and mapred.reduce.copy.backoff can make
changes?
Thank you!
Inifok

Ted Dunning

2009-08-19 07:44:23 UTC

Permalink

Which version of hadoop are you running?

Post by yang song
Hello, all
I have met the problem "too many fetch failures" when I submit a big
job(e.g. tasks>10000). And I know this error occurs when several reducers
are unable to fetch the given map output. However, I'm sure slaves can
contact each other.
I feel puzzled and have no idea to deal with it. Maybe the network
transfer is bad, but how can I solve it? Increase
mapred.reduce.parallel.copies and mapred.reduce.copy.backoff can make
changes?
Thank you!
Inifok

--
Ted Dunning, CTO
DeepDyve

yang song

2009-08-19 12:19:53 UTC

Permalink

I'm sorry, the version is 0.19.1

Post by Ted Dunning
Which version of hadoop are you running?

--
Ted Dunning, CTO
DeepDyve

Ted Dunning

2009-08-19 18:17:26 UTC

Permalink

I think I remember something about 19.1 in which certain failures would
cause this. Consider using an updated 19 or moving to 20 as well.

Post by yang song
I'm sorry, the version is 0.19.1

yang song

2009-08-20 05:39:51 UTC

Permalink

Thank you Ted. Update current cluster is a huge work, we don't want to
do so. Could you tell me how 0.19.1 causes certain failures in detail?
Thanks again.

Post by Ted Dunning
I think I remember something about 19.1 in which certain failures would
cause this. Consider using an updated 19 or moving to 20 as well.

Post by yang song
I'm sorry, the version is 0.19.1

Ted Dunning

2009-08-20 06:25:55 UTC

Permalink

I think that the problem that I am remembering was due to poor recovery from
this problem. The underlying fault is likely due to poor connectivity
between your machines. Test that all members of your cluster can access all
others on all ports used by hadoop.

See here for hints: http://markmail.org/message/lgafou6d434n2dvx

Post by yang song
Thank you Ted. Update current cluster is a huge work, we don't want to
do so. Could you tell me how 0.19.1 causes certain failures in detail?
Thanks again.

Post by Ted Dunning
I think I remember something about 19.1 in which certain failures would
cause this. Consider using an updated 19 or moving to 20 as well.

Post by yang song
I'm sorry, the version is 0.19.1

--
Ted Dunning, CTO
DeepDyve

Jason Venner

2009-08-20 06:59:06 UTC

Permalink

The number 1 cause of this is something that causes a connection to get a
map output to fail. I have seen:
1) firewall
2) misconfigured ip addresses (ie: the task tracker attempting the fetch
received an incorrect ip address when it looked up the name of the
tasktracker with the map segment)
3) rare, the http server on the serving tasktracker is overloaded due to
insufficient threads or listen backlog, this can happen if the number of
fetches per reduce is large and the number of reduces or the number of maps
is very large

There are probably other cases, this recently happened to me when I had 6000
maps and 20 reducers on a 10 node cluster, which I believe was case 3 above.
Since I didn't actually need to reduce ( I got my summary data via counters
in the map phase) I never re-tuned the cluster.

Post by Ted Dunning
I think that the problem that I am remembering was due to poor recovery from
this problem. The underlying fault is likely due to poor connectivity
between your machines. Test that all members of your cluster can access all
others on all ports used by hadoop.
See here for hints: http://markmail.org/message/lgafou6d434n2dvx

Post by yang song
Thank you Ted. Update current cluster is a huge work, we don't want to
do so. Could you tell me how 0.19.1 causes certain failures in detail?
Thanks again.

Post by Ted Dunning
I think I remember something about 19.1 in which certain failures would
cause this. Consider using an updated 19 or moving to 20 as well.

Post by yang song
I'm sorry, the version is 0.19.1

--
Ted Dunning, CTO
DeepDyve

--
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Koji Noguchi

2009-08-20 17:14:17 UTC

Permalink

Probably unrelated to your problem, but one extreme case I've seen,
a user's job with large gzip inputs (non-splittable),
20 mappers 800 reducers. Each map outputted like 20G.
Too many reducers were hitting a single node as soon as a mapper finished.

I think we tried something like

mapred.reduce.parallel.copies=1
(to reduce number of reducer copier threads)
mapred.reduce.slowstart.completed.maps=1.0
(so that reducers would have 20 mappers to pull from, instead of 800
reducers hitting 1 mapper node as soon as it finishes.)

Koji

Post by Jason Venner
The number 1 cause of this is something that causes a connection to get a
1) firewall
2) misconfigured ip addresses (ie: the task tracker attempting the fetch
received an incorrect ip address when it looked up the name of the
tasktracker with the map segment)
3) rare, the http server on the serving tasktracker is overloaded due to
insufficient threads or listen backlog, this can happen if the number of
fetches per reduce is large and the number of reduces or the number of maps
is very large
There are probably other cases, this recently happened to me when I had 6000
maps and 20 reducers on a 10 node cluster, which I believe was case 3 above.
Since I didn't actually need to reduce ( I got my summary data via counters
in the map phase) I never re-tuned the cluster.

Post by yang song
Thank you Ted. Update current cluster is a huge work, we don't want to
do so. Could you tell me how 0.19.1 causes certain failures in detail?
Thanks again.

Post by Ted Dunning
I think I remember something about 19.1 in which certain failures would
cause this. Consider using an updated 19 or moving to 20 as well.

Post by yang song
I'm sorry, the version is 0.19.1

--
Ted Dunning, CTO
DeepDyve

Arun C Murthy

2009-08-19 16:31:21 UTC

Permalink

I'd dig around a bit more to check if it's there it's caused by a
specific set of nodes... i.e. are maps on specific tasktrackers
failing in this manner?

Arun

谭东

2009-08-20 06:44:32 UTC

Permalink

More fewer reducers there are, More data each reducer will deal with, More
network transmission each reducer will attached to, and more probably one
reducer will fail.
SO INCREMENT your reducers, then try again.

Post by Ted Dunning
I think that the problem that I am remembering was due to poor recovery

from

Post by Ted Dunning
this problem. The underlying fault is likely due to poor connectivity
between your machines. Test that all members of your cluster can access

all

Post by Ted Dunning
others on all ports used by hadoop.
See here for hints: http://markmail.org/message/lgafou6d434n2dvx
Thank you Ted. Update current cluster is a huge work, we don't want to
do so. Could you tell me how 0.19.1 causes certain failures in detail?
Thanks again.

Post by Ted Dunning
I think I remember something about 19.1 in which certain failures would
cause this. Consider using an updated 19 or moving to 20 as well.

Post by yang song
I'm sorry, the version is 0.19.1

--
Ted Dunning, CTO
DeepDyve

谭东

2009-08-20 06:56:12 UTC

Permalink

Fewer reducers there are, More data each reducer will deal with, More
network transmission each reducer will be attached to, and more probably one
reducer will fail.
SO INCREMENT your reducers, then try again.

Post by Ted Dunning
I think that the problem that I am remembering was due to poor recovery

from

Post by Ted Dunning
this problem. The underlying fault is likely due to poor connectivity
between your machines. Test that all members of your cluster can access

all

Post by Ted Dunning
I think I remember something about 19.1 in which certain failures would
cause this. Consider using an updated 19 or moving to 20 as well.

Post by yang song
I'm sorry, the version is 0.19.1

--
Ted Dunning, CTO
DeepDyve