Library QC

At step Removing PCR duplicates you used the flag –output-stats, generating a stats file in addition to the pairsam output (e.g. –output-stats stats.txt). The stats file is an extensive output of pairs statistics as calculated by pairtools, including total reads, total mapped, total dups, total pairs for each pair of chromosomes etc’. Although you can use directly the pairtools stats file as is to get informed on the quality of the Micro-C library, we find it easier to focus on a few key metrics. We include in this repository the script get_qc.py that summarize the paired-tools stats file and present them in percentage values in addition to absolute values.

The images below explains how the values on the QC report are calculated:

Command:

python3 ./Micro-C/get_qc.py -p <stats.txt>

Example:

python3 ./Micro-C/get_qc.py -p stats.txt

After the script completes, it will print:

Total Read Pairs                              2,000,000  100%
Unmapped Read Pairs                           92,059     4.6%
Mapped Read Pairs                             1,637,655  81.88%
PCR Dup Read Pairs                            5,426      0.27%
No-Dup Read Pairs                             1,632,229  81.61%
No-Dup Cis Read Pairs                         1,288,943  78.97%
No-Dup Trans Read Pairs                       343,286    21.03%
No-Dup Valid Read Pairs (cis >= 1kb + trans)  1,482,597  90.83%
No-Dup Cis Read Pairs < 1kb                   149,632    9.17%
No-Dup Cis Read Pairs >= 1kb                  1,139,311  69.8%
No-Dup Cis Read Pairs >= 10kb                 870,490    53.33%

Library Complexity

If you preformed a shallow sequencing experiment (e.g. 2M reads) and running a QC analysis to decide which library to use for deep sequencing (DS), it is recommended to evaluate the complexity of the library before moving to DS.

The lc_extrap utility of the preseq package aims to predict the complexity of sequencing libraries.

preseq options:

Parameter	Value	Function
bam		Specifies that the input file type is bam. Please note that for a bam file to be a recognized input file htslib sould be installed as well and preseq should be built with htslib support (for more details see preseq documentation or our installDep.sh script as example)
pe		Specifies that paired end data is used
extrap	2.10E+09	Maximum extrapolation
step	1.00E+08	Extrapolation step size
seg_len	1000000000	maximum segment length when merging paired end bam
output		output file

Please note that the input bam file should be a version prior to dups removal.

preseq lc_extrap command example for extrapolating library complexity:

Command:

preseq lc_extrap -bam -pe -extrap 2.1e9 -step 1e8 -seg_len 1000000000 -output <output file> <input bam file>

Example:

preseq lc_extrap -bam -pe -extrap 2.1e9 -step 1e8 -seg_len 1000000000 -output out.preseq mapped.PT.bam

In this example the output file out.preseq will detail the extrapolated complexity curve of your library, with the number of reads in the first column and the expected distinct read value in the second column. For a typical experiment (human sample) check the expected complexity at 300M reads (to show the content of the file, type cat out.preseq). Expected unique pairs at 300M sequencing is at least ~ 120 million.