Long running application failed to init containers due to anthentication errors

Paul Lam

2018-11-30 02:16:38 UTC

Hi,

I¡¯m running Flink applications on YARN 2.6.0-cdh5.6.0 and get a situation. After running for a while (could be longer than 7 days) the application might
need to rescale up or recover from a node failure but it is not able to allocate new containers. All the incoming containers would fail to localize resources
and create log aggregation dirs for lack of credentials, so the Flink application never gets the requested containers. It seems that the credentials in the
container launch context somehow disappears.

I find this looks very similar to FLINK-6376[1] and YARN-2704[2], but both of them should have been fixed. The Flink AM gets the hdfs delegation token from
the client, put it into the container launch context and will not refresh it afterwards. But IMHO, if the token is expired, the exception should be ¡°token expired¡±
or ¡°token not found in cache¡±, but now what I get is ¡°client cannot authenticate via [token, kerberos]¡±.

This happens very randomly, and I have been struggling with it for couples of days. Any help would be greatly appreciated. Thanks a lot!

[1] https://issues.apache.org/jira/browse/FLINK-6376
[2] https://issues.apache.org/jira/browse/YARN-2704

Best,
Paul Lam