In our Hadoop setup, we ended up having more than 1 million files in a single folder. The folder had so many files, that any hdfs dfs command like -ls, -copyToLocal on the files was giving following error:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Arrays.java:2367) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415) at java.lang.StringBuffer.append(StringBuffer.java:237) at java.net.URI.appendAuthority(URI.java:1852) at java.net.URI.appendSchemeSpecificPart(URI.java:1890) at java.net.URI.toString(URI.java:1922) at java.net.URI.<init>(URI.java:749) at org.apache.hadoop.fs.Path.initialize(Path.java:203) at org.apache.hadoop.fs.Path.<init>(Path.java:116) at org.apache.hadoop.fs.Path.<init>(Path.java:94) at org.apache.hadoop.hdfs.protocol.HdfsFileStatus.getFullPath(HdfsFileStatus.java:230) at org.apache.hadoop.hdfs.protocol.HdfsFileStatus.makeQualified(HdfsFileStatus.java:263) at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:732) at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105) at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755) at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:751) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751) at org.apache.hadoop.fs.shell.PathData.getDirectoryContents(PathData.java:268) at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:347) at org.apache.hadoop.fs.shell.CommandWithDestination.recursePath(CommandWithDestination.java:291) at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:308) at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:278) at org.apache.hadoop.fs.shell.CommandWithDestination.processPathArgument(CommandWithDestination.java:243) at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:260) at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:244) at org.apache.hadoop.fs.shell.CommandWithDestination.processArguments(CommandWithDestination.java:220) at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190) at org.apache.hadoop.fs.shell.Command.run(Command.java:154) at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
After doing some research, we added following environment variable to update Hadoop runtime options.
export HADOOP_OPTS="-XX:-UseGCOverheadLimit"
Adding this option fixed the GC error, but started throwing the following error, citing the lack of Java Heap space.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1351) at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1413) at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1524) at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1533) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:557) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy15.getListing(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1969) at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1952) at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:724) at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105) at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755) at org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:751) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751) at org.apache.hadoop.fs.shell.PathData.getDirectoryContents(PathData.java:268) at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:347) at org.apache.hadoop.fs.shell.CommandWithDestination.recursePath(CommandWithDestination.java:291) at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:308) at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:278) at org.apache.hadoop.fs.shell.CommandWithDestination.processPathArgument(CommandWithDestination.java:243) at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:260) at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:244) at org.apache.hadoop.fs.shell.CommandWithDestination.processArguments(CommandWithDestination.java:220) at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190) at org.apache.hadoop.fs.shell.Command.run(Command.java:154) at org.apache.hadoop.fs.FsShell.run(FsShell.java:287) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
We modified the above export, and tried following instead. Note that instead of HADOOP_OPTS, we needed to set HADOOP_CLIENT_OPTS fix this error. This was needed because all the hadoop commands run as a client. HADOOP_OPTS needs to be setup for modifying actual Hadoop run time, and HADOOP_CLIENT_OPTS is needed to be setup for modifying run time for Hadoop command line client.
export HADOOP_CLIENT_OPTS="-XX:-UseGCOverheadLimit -Xmx4096m"