0

Tips and pitfalls when using Java’s ZipOutputStream

We were doing a bit of re-factoring recently to try to speed up and reduce the memory footprint of a routine that was zipping a folder with 800,000+ files in it. It was taking forever and was eating up a lot of memory.

The 3 main takeaways were:

  • Always wrap your input and output streams in buffered streams
  • Don’t hold large numbers of File objects in memory for extended periods of time if you can avoid it
  • Write 1024 bytes or more at a time to a ZipOutputStream at a time for optimal performance. This was independent of whether or not we were using a BufferedOutputStream or not.

We started with this:

BufferedStreams

You should always wrap the InputStream and OutputStream objects with BufferedInputStream and BufferedOutputStream unless you have a compelling reason not too. ’nuff said about that.

  ZipOutputStream zipStream = new ZipOutputStream(new BufferedOutputStream(new FileOutputStream(zipPath)))
    ...
      BufferedInputStream inputStream = new BufferedInputStream(new FileInputStream(fileAbsolutePath));

Memory

We noticed that when zipping the Folder, we were holding 800,000 File objects in memory for a total of about 400-450MB as a result of the single File.listFiles(). While it was fast once all of the Files were in memory, it was unnecessary to hold all of those Files in memory during the entire zipping process.

The most memory efficient algorithm used Java 7’s java.nio.file package’s FileVisitor interface and Files.walkFileTree(Path start, FileVisitor<? super Path> visitor) method. This reduced the memory usage about 375-425MB.

Unfortunately, we still had to support some Java 6 deployments, so we opted to just load the file names and use the combination of the Folder’s absolute path and the file names to create a FileInputStream:

String[] files = folderToZip.list();
String folderAbsolutePath = folderToZip.getAbsolutePath();
try
{
  for (String file : files)
  {
    String fileAbsolutePath = StringUtil.concat(folderAbsolutePath, File.separatorChar, file);
    BufferedInputStream inputStream = new BufferedInputStream(new FileInputStream(fileAbsolutePath));

Speed

The optimization that surprised me was that writing a single byte to the ZipOutputStream at a time was really slow even though we were buffering both the input/output stream. Even beyond the added method overhead, there seemed to be something inherently slow about calling zipStream.write(int) for each byte read from the input stream. So we settled on keeping another temporary 1K buffer in memory and writing to the ZipStream in 1024 byte increments. This consistently took 90% less time when zipping 1,000 files, 10,000 files, and 800,000 files.

BufferedInputStream inputStream = new BufferedInputStream(new FileInputStream(fileAbsolutePath));
ZipEntry zipEntry = new ZipEntry(file);
try
{
  zipStream.putNextEntry(zipEntry);
  byte[] dataToWrite = new byte[1024];
  int length;
  while ((length = bis.read(dataToWrite)) > 0)
    zipStream.write(dataToWrite, 0, length);
...

Cleaning up

Finally, we made sure to wrap our calls to close the various streams in finally {…} blocks:

...
  try {
    for (String file : files) {
      ... create file stream ...
      try {
        ... add zip entry, create/read file stream, and write to zip stream ...
      }
      finally {
        zipStream.closeEntry();
        inputStream.close();
      }
    }
  }
  finally {
    zipStream.finish();
    zipStream.close();
  }
...

End game

We ended up with this:

Moral of the story was be careful when zipping 800,000 files. What works for few may not work for many 🙂

bradley

Leave a Reply

Your email address will not be published. Required fields are marked *