Variant filtering
For our QC pipeline, we first read in the .vcf file, split multiallelics, and remove sites with more than 6 alleles. After splitting muliallelics in the .vcf file containing 29,911,479 variants and restricting to these sites, we have 37,344,246 variants.
Filter
|
Variants
|
%
|
Variants with < 7 alleles
|
37,344,246
|
100.0
|
Failing VQSR
|
100,742
|
0.3
|
In LCRs
|
1,215,218
|
3.3
|
Outside padded target interval
|
27,119,165
|
72.6
|
Invariant sites after initial variant and genotype filters
|
3,117,961
|
8.3
|
Invariant sites after sample filters
|
1,051,421
|
2.8
|
Overall variant call rate < 0.97
|
737,072
|
2.0
|
Overall variant case call rate < 0.97
|
716,709
|
1.9
|
Overall variant control call rate < 0.97
|
743,659
|
2.0
|
Difference between case and control variant call rate < 0.02
|
232,341
|
0.6
|
Variants failing HWE filter
|
1,083,479
|
2.9
|
Variants remaining after all filters
|
5,104,759
|
13.7
|
Sample filtering
Filter
|
Samples
|
Bipolar cases
|
Controls
|
%
|
Initial samples in vcf
|
39,618
|
16,486
|
17,212
|
100.0
|
Unable to obtain both phenotype and sequence information
|
2
|
NA
|
NA
|
0.0
|
Unknown phenotype
|
32
|
NA
|
NA
|
0.1
|
Low coverage or high contamination
|
133
|
72
|
54
|
0.3
|
Sample call rate < 0.93
|
185
|
124
|
53
|
0.5
|
% FREEMIX contamination > 0.02
|
268
|
146
|
104
|
0.7
|
% chimeric reads > 0.015
|
152
|
49
|
100
|
0.4
|
Mean DP < 30
|
20
|
5
|
12
|
0.1
|
Mean GQ < 55
|
56
|
28
|
25
|
0.1
|
Samples with sex swap
|
238
|
147
|
52
|
0.6
|
Related samples for removal
|
1,716
|
792
|
688
|
4.3
|
PCA based filters
|
2,880
|
1,120
|
1,422
|
7.3
|
Within batch Ti/Tv ratio outside 3 standard deviations
|
100
|
50
|
42
|
0.3
|
Within batch Het/HomVar ratio outside 3 standard deviations
|
150
|
66
|
58
|
0.4
|
Within batch Insertion/Deletion ratio outside 3 standard deviations
|
93
|
31
|
48
|
0.2
|
Within location n singletons outside 3 standard deviations
|
443
|
151
|
236
|
1.1
|
Samples after final sample filters
|
33,527
|
13,933
|
14,422
|
84.6
|
Summary of sample filtering
Filter
|
Samples
|
Bipolar cases
|
Controls
|
%
|
Initial samples in vcf
|
39,618
|
16,486
|
17,212
|
100.0
|
Unable to obtain both phenotype and sequence information
|
2
|
NA
|
NA
|
0.0
|
Unknown phenotype
|
32
|
NA
|
NA
|
0.1
|
Low coverage or high contamination
|
133
|
72
|
54
|
0.3
|
Below sample metric thresholds
|
557
|
276
|
252
|
1.4
|
Samples with sex swap
|
238
|
147
|
52
|
0.6
|
Related samples for removal
|
1,716
|
792
|
688
|
4.3
|
PCA based filters
|
2,880
|
1,120
|
1,422
|
7.3
|
Outliers in batch-specific sample metrics
|
771
|
293
|
374
|
1.9
|
Samples after final sample filters
|
33,527
|
13,933
|
14,422
|
84.6
|