Parsing Variant Call Format (VCF) Files

Hello,

With the advent of the NGS collection, we have had a few requests to handle files in the Variant Call Format (VCF) (http://www.1000genomes.org/node/101). The NGS developers are well aware of this request, and it is on their radar. But, till a beautiful, fully tested, reader is available in a released version of Pipeline Pilot, I took a stab at it with the attached protocol as the result. The protocol has the following features:

  • Data in the Info column is broken out into separate properties; name for the Info properties coming from the Info column itself
  • Data in the sample column(s) is broken out into separate properties; the name for the columns coming from the Format column and the sample name
  • Manages the variable number of comment lines
    • I tried two methods for this; one using only PilotScript, the other uses a reader/writer combination. In my testing the reader/writer combination works faster.
  • Takes into account that the sample property does not always have values for each of the ones listed in the Format property.
  • Provides different names for the DP value in the Info property and the Format property, as they represent different values.
  • Handles multiple Sample properties.
  • Provides a report if any alternative values are indicated in the comment lines.

Hope this is helpful.

Jeannine