0

Tips and pitfalls when using Java’s ZipOutputStream

We were doing a bit of re-factoring recently to try to speed up and reduce the memory footprint of a routine that was zipping a folder with 800,000+ files in it. It was taking forever and was eating up a lot of memory.

The 3 main takeaways were:

  • Always wrap your input and output streams in buffered streams
  • Don’t hold large numbers of File objects in memory for extended periods of time if you can avoid it
  • Write 1024 bytes or more at a time to a ZipOutputStream at a time for optimal performance. This was independent of whether or not we were using a BufferedOutputStream or not.

We started with this:

BufferedStreams

You should always wrap the InputStream and OutputStream objects with BufferedInputStream and BufferedOutputStream unless you have a compelling reason not too. ’nuff said about that.

  ZipOutputStream zipStream = new ZipOutputStream(new BufferedOutputStream(new FileOutputStream(zipPath)))
    ...
      BufferedInputStream inputStream = new BufferedInputStream(new FileInputStream(fileAbsolutePath));

Memory

We noticed that when zipping the Folder, we were holding 800,000 File objects in memory for a total of about 400-450MB as a result of the single File.listFiles(). While it was fast once all of the Files were in memory, it was unnecessary to hold all of those Files in memory during the entire zipping process.

The most memory efficient algorithm used Java 7’s java.nio.file package’s FileVisitor interface and Files.walkFileTree(Path start, FileVisitor<? super Path> visitor) method. This reduced the memory usage about 375-425MB.

Unfortunately, we still had to support some Java 6 deployments, so we opted to just load the file names and use the combination of the Folder’s absolute path and the file names to create a FileInputStream:

String[] files = folderToZip.list();
String folderAbsolutePath = folderToZip.getAbsolutePath();
try
{
  for (String file : files)
  {
    String fileAbsolutePath = StringUtil.concat(folderAbsolutePath, File.separatorChar, file);
    BufferedInputStream inputStream = new BufferedInputStream(new FileInputStream(fileAbsolutePath));

Speed

The optimization that surprised me was that writing a single byte to the ZipOutputStream at a time was really slow even though we were buffering both the input/output stream. Even beyond the added method overhead, there seemed to be something inherently slow about calling zipStream.write(int) for each byte read from the input stream. So we settled on keeping another temporary 1K buffer in memory and writing to the ZipStream in 1024 byte increments. This consistently took 90% less time when zipping 1,000 files, 10,000 files, and 800,000 files.

BufferedInputStream inputStream = new BufferedInputStream(new FileInputStream(fileAbsolutePath));
ZipEntry zipEntry = new ZipEntry(file);
try
{
  zipStream.putNextEntry(zipEntry);
  byte[] dataToWrite = new byte[1024];
  int length;
  while ((length = bis.read(dataToWrite)) > 0)
    zipStream.write(dataToWrite, 0, length);
...

Cleaning up

Finally, we made sure to wrap our calls to close the various streams in finally {…} blocks:

...
  try {
    for (String file : files) {
      ... create file stream ...
      try {
        ... add zip entry, create/read file stream, and write to zip stream ...
      }
      finally {
        zipStream.closeEntry();
        inputStream.close();
      }
    }
  }
  finally {
    zipStream.finish();
    zipStream.close();
  }
...

End game

We ended up with this:

Moral of the story was be careful when zipping 800,000 files. What works for few may not work for many 🙂

0

On development tools and upgrading to Rails 3.2.11 to apply security fixes

I was struck today by just how good software development tools have gotten. After reading about this recent security vulnerability in Rails (fixed in 3.2.11), I was able to upgrade Rails and deploy the changes in a few simple steps:
https://gist.github.com/4490354
This relied on:

Things have certainly gotten easier in the last few years.

0

How to approach changes in software

Arguably one of the most difficult things we do as software developers at my company, Hannon Hill, is to change existing functionality in our products. You’ve got an awesome customer base to whom you’ve pledged your undying loyalty. Every once in a while, you’re going to pull the rug out from under them by changing some piece of functionality that they rely on. It’s remarkable how much of our time we spend debating, making and ultimately supporting these kinds of changes in our products. But you should spend a lot of time wrestling with these decisions.

Here are a few guidelines and reminders when changing existing functionality in software (though this could probably apply to any type of business that sells a product or service):

  • Will the changes make life better for at least one of your internal stakeholders — support, engineering, sales, marketing, services? Will it reduce the number of support requests? Will it make the product easier to maintain? Easier to sell?
  • Will they help 80% of your current customer base? Of course, it should also make things better for future customers (that’s presumably why you’re changing it), but don’t forget the people that got you to where you are.
  • Will they negatively impact less than 5% of your current customers? SaaS makes this much easier to assess and conversely, installed software makes it very difficult. Have a rough plan of action of how you intend to address any problems that arise as a result of the change.
  • Communicate. Make sure you let people know what you change — in your product release notes, in personal emails to customers, on Twitter, wherever you can! Be prepared to justify your decision. We usually have good reasons for changing things. Our customers are smart and it’s up to us to convince them that a change is for the better.
  • Not everyone will be happy. This is a fact of life and a fact of software too. Despite your best intentions and efforts to accommodate your customers, there will always be vocal detractors whenever anything changes. This is one of the toughest aspects of product management especially when you know many of your customers by name. Remember that very few changes to software can truly benefit all of your users. Be willing to talk openly about your decision, try not to lose too much sleep, and stay positive!

Software developers: how does your organization approach changes to its software? Software users: what could do better when making changes to software that you use?

0

Product Mascots

I’ve been working on a cool, new product called Spectate which is an inbound marketing tool made by the awesome folks at Hannon Hill. A few people on our team realized something really important: every product needs a mascot. Meet Spectate’s mascot, the “Spectato”:

The Spectato

I love aliteration which is probably why I’m such a fan. Thanks Syl Turner (@SylTurner) for the drawing!

What’s your product’s mascot?

0

Spring-enabled Quartz jobs: the right way!

So we’ve been using Spring and Quartz together at my company for a couple of years and I finally had one of those face-palm moments where I realized… we’d been doing it wrong. We needed to make Spring-managed beans accessible to our Quartz jobs that were being instantiated at run-time by the Quartz scheduler. Our solution was to create a “holder bean” that provided static access to our Spring-managed services. It always seemed wrong and I was sure Spring had to have a more elegant way of doing it, but couldn’t find it in specifically mentioned in their documentation about Quartz integration. Thanks Alex Marshall for this post which prompted me to read the JavaDocs of the SchedulerFactoryBean and figure out how to pass Spring beans to Jobs not with the JobDetail (which in our case is serialized to a database) but with the Scheduler context using SchedulerFactoryBean’s “schedulerContextAsMap” property. So with a quick addition to the SchedulerFactoryBean configuration in our Spring applicationContext file:

<bean id="sched"
  class="org.springframework.scheduling.quartz.SchedulerFactoryBean">
  <property name="schedulerContextAsMap">
    <map>
      <entry key="<springBeanName>" value-ref="<springBean>" />
    </map>
  </property>
</bean>

we can now access this Spring bean from a Quart job instantiated by the Scheduler during runtime:

/* (non-Javadoc)
 * @see org.quartz.Job#execute(org.quartz.JobExecutionContext)
 */
public void execute(JobExecutionContext context) throws JobExecutionException 
{
  try {
    SchedulerContext cxt = context.getScheduler.getContext();
    SpringBean bean = (ServiceBean) cxt.get("beanName");
    ...do stuff with Spring bean...
  } catch (SchedulerException e) {
   throw new JobExecutionException(e);
  }
}

You can do this more holistically by adding the entire Spring ApplicationContext object to the Scheduler context using the SchedulerFactoryBean’s “setApplicationContextSchedulerContextKey” property. That’s a mouthful!