NGless Miscellania [5/5]

NOTE
: As of Apr 2016, ngless is available
only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please get in touch
if you are interested in using ngless in your projects.

This is the last in a series of five posts introducing ngless.

  1. Introduction to ngless
  2. Perfect reproducibility using ngless
  3. Fast and high quality error detection
  4. Extending and interacting with other projects
  5. Miscellaneous [this post]

Ngless has a few not so visible
details that can come in handy.

Local installation

ngless relies on a few third-party utilities (bwa and samtools, besides any other modules you install) as well as possibly reference information. However, it does not require either (1) a super user install nor (2) fiddling with PATH variables or such. It is happy to install its data into your home directory and run from there.

You can also install it globally, of course, but in many academic settings, you need to ask permission to install a package globally, while you can do whatever you want in your home directory. NGless is designed with this in mind.

On the fly QC (quality control)

All FastQ files are automatically passed through a QC analysis when you load them and again after any preprocessing step. You do not need to specify QC as a separate step, it just happens. In fact, if possible, ngless will run it on the fly
for efficiency reasons.

Best practices should be easy
and QC is a best practice.

Subsample mode

Subsample mode simply throws away 99% of the data
.

Why would anyone ever want to do this?

This allows you to quickly check whether your pipeline works as expected and the output files are as expected. For example:

ngless --subsample script.ngl

will run script.ngl
in subsample mode, which will probably run much faster than the full pipeline, allowing to quickly spot any issues with your code. A 10 hour pipeline will finish in a few minutes when running in subsample mode.

Subsample mode also changes all your write()
so that the output files include the subsample
extension. That is, a call such as

write(output, ofile='results.txt')

will automatically get rewritten to

write(output, ofile='results.txt.subsample')

This ensures that you do not confuse subsampled results with the real thing. NGless is all about making sure your results are correct, so it tries to avoid confusing you as much as possible (this is similar to how it always writes output files with the atomic protocol so that you never get a partial results file).

Parallel processing & speed

The main goal of ngless is to save bioinformaticians time while improving the results
. However, as a side benefit of having a well-defined language, the interpreter can take automatic advantage of multiple processors.

Consider the following script:

ngless '0.0'

input = fastq('input.fq.gz')
preprocess(input) using |r|:
    r = substrim(r, min_quality=45)
    if len(r) < 45:
        discard
mapped = map(input, reference='hg19')
counted = count(mapped, features=['gene'])
write(counted, ofile='genes.txt')

Almost all the steps in the pipeline can take advantage of multiple processors:

  1. QC is performed on the fly as the file ‘input.fq.gz’ is being read.
  2. preprocess
    takes advantage of mulitple processors by processing reads in parallel
  3. map
    calls bwa
    which makes use of threads
  4. count
    again processes the output of mapping in parallel.

To use more than one core in ngless
, just use the option -j
with the number of threads you want. For example:

ngless -j8 pipeline.ngl

Will run with 8 cores, speeding the processing considerably.

稿源:Meta Rabbit (源链) | 关于 | 阅读提示

本站遵循[CC BY-NC-SA 4.0]。如您有版权、意见投诉等问题,请通过eMail联系我们处理。
酷辣虫 » 综合编程 » NGless Miscellania [5/5]

喜欢 (0)or分享给?

专业 x 专注 x 聚合 x 分享 CC BY-NC-SA 4.0

使用声明 | 英豪名录