Log in

View Full Version : Online service for fastq and BAM conversion



vbnetkhio
08-26-2021, 08:20 PM
https://usegalaxy.org/

free online service for fastq and BAM conversion, and some other tools too

Lucas
08-26-2021, 09:01 PM
https://usegalaxy.org/

free online service for fastq and BAM conversion, and some other tools too

Shitload of various tools but where is Fastq to BAM converter which is the most important?

vbnetkhio
08-26-2021, 09:05 PM
Shitload of various tools but where is Fastq to BAM converter which is the most important?

BWA MEM is there. some others too (regular bwa etc.), but I think there's no need to look further:

https://www.biostars.org/p/117225/

The mapping rates were:
bowtie2: 30%
bwa aln: 25%
bwa mem: 85%

Of course each of these mapping rates are for default settings that can be changed (see comments down) - but that's where we always start. From those it looks like bwa mem goes a step further and will find alignments where other methods have already given up.

btw: Arza used aln for those Goth samples, and they turned out fine in G25. so mem can only be better,

vbnetkhio
08-26-2021, 09:12 PM
Shitload of various tools but where is Fastq to BAM converter which is the most important?

there are multiple BAM to vcf tools too.

I think "bcftools mpileup" and then "bcftools call" is the best.

Peterski
08-27-2021, 12:17 PM
Nice website, thanks for the link!

Lucas
09-09-2021, 08:38 PM
there are multiple BAM to vcf tools too.

I think "bcftools mpileup" and then "bcftools call" is the best.

Did you check at least some fastq in it? How long it takes?

vbnetkhio
09-10-2021, 10:34 AM
Did you check at least some fastq in it? How long it takes?

Yes, that's how i converted the Maslomecz file. It was very fast, maybe half an hour.

Lucas
09-10-2021, 10:45 AM
Yes, that's how i converted the Maslomecz file. It was very fast, maybe half an hour.

Can you post pipeline steps?

vbnetkhio
09-10-2021, 11:42 AM
1)upload the fastq file, it will be added to your "history"

2) search for the "Map with BWA-MEM" tool
-select "Human (Homo Sapiens) (b37): hg19" as the reference genome
-select single end reads
-select your fastq from the history
-execute

3) search for "bcftools mpileup"
-select your bam output from the first step
-select "Human (Homo Sapiens): hg19" as the ref genome
-for faster conversion and a smaller vcf output, i also upload a text file with ftdna and 23andme positions only.
The format is "chromosome position" tab separated, one per line.
E.g.
1 84582272
1 25797421
I think i had some problems with 0 and MT chromosomes so just remove those if you get errors.
-then go Restrict to - regions - operate on regions specified in a history dataset and select that text file you uploaded
-execute

4)go to "bcftools call"
-select your bcf file from the previous step
-select uncompressed vcf output
-execute

Lucas
09-10-2021, 11:45 AM
1)upload the fastq file, it will be added to your "history"

2) search for the "Map with BWA-MEM" tool
-select "Human (Homo Sapiens) (b37): hg19" as the reference genome
-select single end reads
-select your fastq from the history
-execute

3) search for "bcftools mpileup"
-select your bam output from the first step
-select "Human (Homo Sapiens): hg19" as the ref genome
-for faster conversion and a smaller vcf output, i also upload a text file with ftdna and 23andme positions only.
The format is "chromosome position" tab separated, one per line.
E.g.
1 84582272
1 25797421
I think i had some problems with 0 and MT chromosomes so just remove those if you get errors.
-then go Restrict to - regions - operate on regions specified in a history dataset and select that text file you uploaded
-execute

4)go to "bcftools call"
-select your bcf file from the previous step
-select uncompressed vcf output
-execute

Thanks. If selection as ref hg38 would be better or not?

vbnetkhio
09-10-2021, 11:49 AM
Thanks. If selection as ref hg38 would be better or not?

I think not, all software, gedmatch, g25, still uses hg19, also if you use that list of 23andme/ftdna positions i mentioned, it's also in hg19

Lucas
09-10-2021, 11:49 AM
Ok started conversion of some Chad sample which I hesitated before because of enormous size:)

Lucas
09-10-2021, 11:56 AM
1)upload the fastq file, it will be added to your "history"

2) search for the "Map with BWA-MEM" tool
-select "Human (Homo Sapiens) (b37): hg19" as the reference genome
-select single end reads




What about paired fastq, this time paired end reads? Or you convert them separately?

vbnetkhio
09-10-2021, 12:41 PM
What about paired fastq, this time paired end reads? Or you convert them separately?

Idk, i didn't encounter them yet. If there are 2 fastq files of the same sample is that paired? or could it just be a sample split into 2 files?

I just converted those separately and then merged them in the end.

Lucas
09-10-2021, 12:53 PM
Idk, i didn't encounter them yet. If there are 2 fastq files of the same sample is that paired? or could it just be a sample split into 2 files?

I just converted those separately and then merged them in the end.

Ok I tried paired and has error that no index... Now I convert one from the pair and it seems is processed. What you use here for merging bams?

vbnetkhio
09-10-2021, 01:02 PM
Ok I tried paired and has error that no index... Now I convert one from the pair and it seems is processed. What you use here for merging bams?

I just convert them separately and merge them in the end with dnakitstudio

Lucas
09-10-2021, 02:21 PM
I just convert them separately and merge them in the end with dnakitstudio

Last question if I can close browser and it still will be processed on my galaxy account?

vbnetkhio
09-10-2021, 02:28 PM
Last question if I can close browser and it still will be processed on my galaxy account?

Yes, only uploading in proccess gets cancelled if you close the browser

Peterski
09-10-2021, 06:34 PM
1)upload the fastq file, it will be added to your "history"

2) search for the "Map with BWA-MEM" tool
-select "Human (Homo Sapiens) (b37): hg19" as the reference genome
-select single end reads
-select your fastq from the history
-execute

3) search for "bcftools mpileup"
-select your bam output from the first step
-select "Human (Homo Sapiens): hg19" as the ref genome
-for faster conversion and a smaller vcf output, i also upload a text file with ftdna and 23andme positions only.
The format is "chromosome position" tab separated, one per line.
E.g.
1 84582272
1 25797421
I think i had some problems with 0 and MT chromosomes so just remove those if you get errors.
-then go Restrict to - regions - operate on regions specified in a history dataset and select that text file you uploaded
-execute

4)go to "bcftools call"
-select your bcf file from the previous step
-select uncompressed vcf output
-execute

Are there, currently, any interesting samples in FASTQ worth converting ???

vbnetkhio
09-10-2021, 07:10 PM
Are there, currently, any interesting samples in FASTQ worth converting ???

The Polish Gothic/Medieval samples, and 2 or 3 Hungarian conqueror studies. (They are all low quality, but there's a lot of them, so they can be merged by period/arch. site)

I tried the Arpad dinasty fastqs, but they seem to include only y and mtdna. And those 200 Macedonians have very few snps.

vbnetkhio
09-10-2021, 07:41 PM
Are there, currently, any interesting samples in FASTQ worth converting ???

These 3 Hungarian studies are available on ENA and not yet converted:
Maternal lineages from 10-11th century commoner cemeteries of the Carpathian Basin

Mitogenomic data indicate admixture components of Central-Inner Asian and Srubnaya origin in the conquering Hungarians

Early medieval genetic data from Ural region evaluated in the light of archaeological evidence of ancient Hungarians

Polish data
https://ftp.cngb.org/pub/gigadb/pub/10.5524/100001_101000/100310/Raw_data/

Lucas
09-11-2021, 05:45 PM
30 GB bam was produced for about day. I guess using fastqbam on my computer it would takes a week:D

Lucas
09-12-2021, 07:32 AM
30 GB bam was produced for about day. I guess using fastqbam on my computer it would takes a week:D

OMG I attained limit of 250 GB on acount.

vbnetkhio
09-12-2021, 07:51 AM
OMG I attained limit of 250 GB on acount.
After which step?
Did you extract commercial company snps in the second step? The file should be much smaller then.

Also you should delete datasets from the previous steps you don't need anymore, but it's a bit tricky, you need to use "delete hidden datasets" and "purge deleted datasets" in the history settings after deleting them, or if you have more than 1 history there is also something like "purge deleted histories"

Lemminkäinen
09-12-2021, 09:05 AM
I use BWA to align reads.

Lemminkäinen
09-12-2021, 09:12 AM
Are there, currently, any interesting samples in FASTQ worth converting ???

Some studies release only Fastq-files in ENA or in other archives.. You can search by study project names from ENA.

Lucas
09-12-2021, 09:15 AM
After which step?
Did you extract commercial company snps in the second step? The file should be much smaller then.

Also you should delete datasets from the previous steps you don't need anymore, but it's a bit tricky, you need to use "delete hidden datasets" and "purge deleted datasets" in the history settings after deleting them, or if you have more than 1 history there is also something like "purge deleted histories"

What you said before was enough and I found it by myself that must use "purge.."

It was fault of very big fastq, which produced 30GB bam. Then vcf was few times bigger. And I did it two times simultaneously. So I reached maximum of capacity. OK now I'm downbloading those bams and will convert them in WGS as usual. But probably only those older version would work for them.

smd555
09-19-2021, 05:58 PM
first upload your file, then in bcftools mpileup select this file under "input BAM/CRAM", and select hg19 under "Select reference genome", and then you can execute the algorythm. When it finishes it will output a BCF file and it will be added to your history.

then after that run "bcftools call" on that BCF file output from the first step, just select the file and run it, and you'll get another BCF file. in this step you can already choose to output a VCF file but it will probably be too big and take too long to download and convert.

because of that I also run "bcftools filter" on the second BCF file. select the second bcf file, then upload this file : https://easyupload.io/769eny , then under Restrict To > Regions select "Operate on Regions specified in a history dataset", and select this dataset you uploaded (new.tsv), and for your output type select "uncompressed VCF".

then you can download this VCF file, and convert it to 23andme with DNAKitStudio.

for this file, I started with the fastq file, not BAM. in this case there is one extra step at the begining. convert the fastq to BAM with "map with bwa-mem", you also need to select hg19 reference genome, "single" under "Single or Paired-end reads", and then select your fastq file and run it. then you'll have a BAM and you can do the mpileup and the rest.

1. With the help of "bcftools call" from the BCF file I produce the second BCF file (more compressed). But when I run "bcftools filter" and try to produce VCF - the error occurs:
[E::bcf_sr_regions_init] Could not parse the file /galaxy-repl/main/files/061/373/dataset_61373762.dat, using the columns 1,2[,-1]
Failed to read the regions: /galaxy-repl/main/files/061/373/dataset_61373762.dat

This occurs also if I try to run the first BCF (61 - original BCF).

2. Also "bcftools call" does not make uncompressed VCF from the second BCF(more compressed), only from the first BCF. The error:
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
Wrong number of PL fields? nals=1 npl=-3

vbnetkhio
09-19-2021, 10:39 PM
1. With the help of "bcftools call" from the BCF file I produce the second BCF file (more compressed). But when I run "bcftools filter" and try to produce VCF - the error occurs:

This occurs also if I try to run the first BCF (61 - original BCF).

2. Also "bcftools call" does not make uncompressed VCF from the second BCF(more compressed), only from the first BCF. The error:

There is probably something wrong with your .dat file, could you post the first few lines from it?

Btw you can skip filtering and just output an uncompressed vcf from bcftools call but it will probably be very big.

Lucas
09-19-2021, 10:51 PM
There is probably something wrong with your .dat file, could you post the first few lines from it?

Btw you can skip filtering and just output an uncompressed vcf from bcftools call but it will probably be very big.

It is also possible to download vcf and convert using plink. But maybe this kind of vcf needs additional processing before?

vbnetkhio
09-19-2021, 11:10 PM
It is also possible to download vcf and convert using plink. But maybe this kind of vcf needs additional processing before?

i use Dnakitstudio because it can convert vcf directly to 23andme. But it should work in plink too.

Lucas
09-20-2021, 12:05 PM
i use Dnakitstudio because it can convert vcf directly to 23andme. But it should work in plink too.

Yes I forget about it

smd555
09-27-2021, 05:34 PM
There is probably something wrong with your .dat file, could you post the first few lines from it?



Do you mean my "filter.txt" file, which I have made from "Template_23andme_v3"? This is how its beginning looks like in Excel:

https://i.ibb.co/JBM29Dv/2021-09-27-203156.png (https://imgbb.com/)

Lucas
09-27-2021, 06:40 PM
Generally good practise is not to use excel for opening genome files in Windows, especially when you want to save it but text editor which can open large files quickly like Edit Pad. It also does not change formatting / encoding like standard notepad.

smd555
09-28-2021, 03:28 PM
Generally good practise is not to use excel for opening genome files in Windows, especially when you want to save it but text editor which can open large files quickly like Edit Pad. It also does not change formatting / encoding like standard notepad.

Did you try that bcftools filter? I wonder what is wrong - this tool, or my filter file, or my settings, or the input bcf files?

smd555
09-28-2021, 08:35 PM
The other problem is that some .bcf cannot be compressed to more small .bcf using bcftools call - there is not error, but the resulting .bcf is empty
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid

vbnetkhio
10-18-2021, 08:36 AM
Do you mean my "filter.txt" file, which I have made from "Template_23andme_v3"? This is how its beginning looks like in Excel:

https://i.ibb.co/JBM29Dv/2021-09-27-203156.png (https://imgbb.com/)

i found the problem. chromosomes should be named like chr1, chr2 ... chrY.
MT should probably be excluded.

here's the file:

https://easyupload.io/vwenmx

you should use it in the "bcftools mpileup" step already. (restrict to>regions>history dataset)
it wil finish much faster and the output file will be smaller. after that it's not neccesary to filter anymore, just run bcftools call and you have your file.

smd555
10-21-2021, 02:39 PM
i found the problem. chromosomes should be named like chr1, chr2 ... chrY.
MT should probably be excluded.

here's the file:

https://easyupload.io/vwenmx

you should use it in the "bcftools mpileup" step already. (restrict to>regions>history dataset)
it wil finish much faster and the output file will be smaller. after that it's not neccesary to filter anymore, just run bcftools call and you have your file.

Thank you very much! :thumb001: This filter works. Unfortunately I have many problems with files from European Nucleotide Archive. It will take some time to check all the nuances with filtering.

smd555
10-22-2021, 03:15 PM
I have a question - on the Vahaduo there are samples of the Sredniy Stog I4110_Ukraine_EN and I5882_Ukraine_EN Did they come from here?:https://www.ebi.ac.uk/ena/browser/view/PRJEB22652?show=reads
Their files from ENA generate empty VCFs.

vbnetkhio
10-22-2021, 09:01 PM
I have a question - on the Vahaduo there are samples of the Sredniy Stog I4110_Ukraine_EN and I5882_Ukraine_EN Did they come from here?:https://www.ebi.ac.uk/ena/browser/view/PRJEB22652?show=reads
Their files from ENA generate empty VCFs.

bam or vcf files? with filtering or without?

try without filtering, or with a different version of the genotype build (for example hg_g1k_v37 instead of the normal hg19) and try with both kinds of filter (chr1, chr2 or 1,2..)

smd555
10-23-2021, 03:41 PM
bam or vcf files? with filtering or without?

try without filtering, or with a different version of the genotype build (for example hg_g1k_v37 instead of the normal hg19) and try with both kinds of filter (chr1, chr2 or 1,2..)

Filter produces vcf (330 000 lines, 44.6 Mb), that produces 0 snps in DNA Kit Studio. The same is for all reference human genotypes with filtering.
Without filtering produces vcf (130 000 000 lines, 15 Gb), but I cannot download it (after downloading 3-5 Gb of the file a message appears that the disk is full, although the disk is actually full of free space).
Also I cannot upload into usegalaxy big bam or fastq files (>5 Gb).