What is BMap?

BMap is a program that can efficiently map whole-genome (WG-) and targeted (TG-) bisulfite sequence (BS) reads to the reference genomes. It is especially useful for reads obtained using post-bisulfite adaptor tagging (PBAT) (Miura F. et al, 2012).

A series of programs that support primary analysis  of either WGBS and TGBS using BMap are provided from here.


References

  • Miura F, Enomoto Y, Dairiki R, Ito T. Amplification-free whole-genome bisulfite sequencing by post-bisulfite adaptor tagging. Nucleic Acids Res. (2012), 40, e136
  • Miura F. and Ito T. BMap: A mapper adapted to whole-genome bisulfite sequencing by post-bisulfite adaptor tagging. in preparation

How to get BMap

You can use BMap either by compiling from its source code or by downloading pre-compiled executables.


Source code

Since BMap is coded with the C++ library Qt, you can compile and use it on the platforms supported by Qt. The source code is available from here.

Once you install Qt library in your system correctly, you can easily compile the source codes. After extraction of the source codes to an appropriate directory, change working directory to the directory and type the following commands.

>qmake BMap.pro
>make (or nmake for Visual Studio on Windows)

After completion of the commands, you can find an executable file in the directory named "bin" located at the same layer with the directory to which you extracted the source code.


Executable binary files

We had tested BMap on Windows, Macintosh and CentOS. For your convenience, executable files for these three platforms can be downloaded from following links. We prepared only 64-bit versions because 32-bit versions are limited for species of smaller genome and couldn't be used for larger ones like human and mouse. If you want to obtain 32-bit versions, please compile from the source code.

Platform compiled (Qt version) Link
Windows, 64-bit (Qt5.1) BMap-v1.0-win64.zip
Macintosh, 64-bit (Qt5.1) BMap-v1.0-mac64.tar.gz
Linux CentOS 5 , 64-bit (Qt4.7) BMap-v1.0-linux-centos64.tar.gz


Environment for running BMap

We usually run BMap on computers with the following specs.

CPU Intel Core i7 (4-core, 8-thread), 3.06 GHz
RAM 24 Gbyte
HDD SATA 500 Gbyte
Linux Version CentOS 4.10

With default parameters, BMap requires about 20 Gbytes for mapping reads to the human genome. If you assign more memory, you can accelerate BMap.

 

How to use BMap

At first you have to prepare an index for your reference genome. Then you can run BMap


Quick instructions

1. Create or download an appropriate index for your reference genome.

>BMap -create -species human -revision hg19 -reference ./path/to/chr*.fa

2. Map your reads.

For single reads, use option(s) -fastq, -qseq and/or -fasta to specify query files.

>BMap -species human -revision hg19 -fastq path_to_read.fastq

For paired-end reads, use option -pfastq for specify query files.

>BMap -species human -revision hg19 -pfastq path_to_read1.fastq path_to_read2.fastq

Usage (Detailed version)

1. Create an index file for your reference

First, you have to prepare an index file for your reference genome as following.

> BMap -create -species human -revision hg19 -reference chr1.fa chr2.fa chr3.fa

The option "-create" orders BMap to construct an index structure.

You can specify your reference genome with "-reference" option. The reference file should be formatted in fasta, and multiple files can be specified like "-reference chr1.fa chr2.fa". You can also use a wildcard for specifying multiple files as "-reference chr*.fa".

The options "-species"and "-revision" specify the name of species and revision of genome sequences, respectively. These options also specify the directory names and the file name to store the index structure. In default settings, BMap uses the "Revisions" directory, which is located in the same directory with the executable file, to store the index files.

In this step, BMap create suffix arrays for two duplicated reference genome sequences; one is generated by converting every C to T (C2T) and the other is generated by converting every G to A (G2A). These suffix arrays are created using an algorithm termed "induced sorting" (Ge Nong et al., 2011), which is able to build the index structure in linear computational time with the length of input sequence. After creating the suffix arrays, BMap next constructs a Burrows-Wheeler Transform (BWT) version of C2T and G2A sequences required to construct FM-index (Ferragina P. & Manzini M., 2001).

References

  • Nong G. Zhang S. & Chan W. H, Two Efficient Algorithms for Linear Time Suffix Array Construction, IEEE Transactions, 60, 1471-1484
  • Ferragina P. & Manzini M.(2001), An experimental study of an opportunistic index. In Proc. 12th ACM-SIAM Symposium on Discrete Algorithms (SODA), 269-278

2. Map your reads

After creation of the index file, you can do mapping with BMap. BMap accept fasta, fastq and illumina qseq file as query file(s) for single-end reads. Only pair(s) of fastq files are acceptable for paired-end reads. The simplest commands for single-end and paired-end reads are as follows.

For single end reads

>BMap -species human -revision hg19 -fastq path_to_read.fastq

For paired end reads

>BMap -species human -revision hg19 -pfastq path_to_read1.fastq path_to_read2.fastq

Options

1. Specifying reference genome or index file

Option Value type Descriptions
-species string Specify species name. Use in combination with option "-revision"
-revision string Specify revision name. Use in combination with option "-species"
-index string Specify path to index file. Mutually exclusive with the combination of options "-species" and "-revision"
-root_dir string Specify the path to root for index path tree. (default="Revisions")
-binseq string Specify path to binseq file

2. Specifying number of threads used for mapping

Option Value type Descriptions
-thread integer Specify number of CPU core or threads used in mapping mode. (default=1)

3. Controlling sensitivity and speed of mapping

Option Value type Descriptions
-seedoffset integer Specify offset position of read to take first seed (default=1)
-seedstep integer Specify step width of seed (default=1)
-seediterate integer Specify number of repeat to take seeds (default=100)
-minseedlen integer Specify minimum length of seed (default=16)
-maxseedlen integer Specify maximum length of seed (default=26)
-minseedcount integer Specify minimum seed count for a candidate genomic region to be considered for verification of alignment with read.
-maxseedcount integer Specify maximum seed count to be considered for verification of alignment with read.

4. Controlling stringency of alignment

Option Value type Descriptions
-minscore integer Specify minimum score for alignment to report. Values for a match and a mismatch are 5 point and -30 point, respectively. A mismatch between C on reference and T on reads is considered as a match. (default=150)

5. Selection of type of index structure and memory usage

Option Value type Descriptions
-bwt integer Specify index structure dependent on values:
  • 0: Use suffix array.
  • 1: Use FM-index with -SCF 4 -SCL 16
  • 2: Use FM-index with -SCF 8 -SCL 16
  • 3: Use FM-index with -SCF 16 -SCL 16
  • 4: Use FM-index with -SCF 32 -SCL 32
  • 5: Use FM-index with -SCF 64 -SCL 64
  • 6: Use FM-index with -SCF 128 -SCL 128

6. Manipulating I/O

Option Value type Descriptions
-reference string(s) Specify input file name(s) for creating index
-fasta string Specify input file formatted in fasta. Only available for single-end sequencing reads
-fastq string(s) Specify input file formatted in fastq. Only available for single-end sequencing reads
-qseq string(s) Specify input file formatted in qseq. Only available for single-end sequencing reads
-pfastq two strings Specify input file name(s) for paired-end sequencing reads
-bisulalign string Specify output path for bisulalign file. (default=bmapresult.bisulalign)
-sum string Specify output path for sum file. (default=bmapresult.sum)

File formats

1. Summary file (extension=.sum)

Summary file is a tab delimited text file that contains the result of mapping for each read. Depending on query sequence(s), it contains the following informations.

For single end reads
Column number Descriptions
1 Clone name
2 Number of genomic positions mapped with the highest score
3 Read length
4 Name for chromosome mapped
5 Start coordinate of alignment in reference
6 End coordinate of alignment in reference
7 Start position of alignment in read
8 End position of alignment in read
9 Alignment type
  • 1: CtoT converted read is mapped on CtoT converted genome
  • 2: GtoA converted read is mapped on CtoT converted genome
  • 3: CtoT converted read is mapped on GtoA converted genome
  • 4: GtoA converted read is mapped on GtoA converted genome

For paired end reads
Column number Descriptions
1 Clone name
2 Number of genomic positions mapped with highest score
3 Read length
4 Name for chromosome mapped (read 1)
5 Start coordinate of alignment in reference (read 1)
6 End coordinate of alignment in reference (read 1)
7 Start position of alignment in read (read 1)
8 End position of alignment in read (read 1)
9 Alignment type (read 1)
  • 1: CtoT converted read is mapped on CtoT converted genome
  • 2: GtoA converted read is mapped on CtoT converted genome
  • 3: CtoT converted read is mapped on GtoA converted genome
  • 4: GtoA converted read is mapped on GtoA converted genome
10 Number of genomic positions mapped with highest score (read 2)
11 Read length (read 2)
12 Name for chromosome mapped (read 2)
13 Start coordinate of alignment in reference (read 2)
14 End coordinate of alignment in reference (read 2)
15 Start position of alignment in read (read 2)
16 End position of alignment in read (read 2)
17 Alignment type (read 2)
  • 1: CtoT converted read is mapped on CtoT converted genome
  • 2: GtoA converted read is mapped on CtoT converted genome
  • 3: CtoT converted read is mapped on GtoA converted genome
  • 4: GtoA converted read is mapped on GtoA converted genome

2. Bisulailgn file (extension=.bisulalign)

Alignment file is a text file that contains the best alignment for each query. Each alignment composed of 11 lines.

For single end reads
Column number Descriptions
1 Query name (Start with ">")
2 Chromosome name
3 Start coordinate of query
4 End coordinate of query
5 Start coordinate of reference
6 End coordinate of reference
7 Alignment type
  • 1: CtoT converted read is mapped on CtoT converted genome
  • 2: CtoT converted read is mapped on GtoA converted genome
  • 3: GtoA converted read is mapped on CtoT converted genome
  • 4: GtoA converted read is mapped on GtoA converted genome
8 Query sequence
9 Annotation
10 Reference sequence
11 Blank line