Genomics: algorithms and analysis

Assembly

Do assembly based on OLC algorithm (program newbler).

First, count the number of sequences in each library for the overview. The first assembly should only be done with unpaired and short paired-end libraries, and the second assembly should include 3k mate-pairs.
Why? This will help you verify how small a portion of the data is made up of mate-pair libraries, and yet, how significantly they improve the assembly.

Perform the assembly (without .sff 3k mate-pairs) using the de Bruijn algorithm (program SOAP-denovo).
Why? This will help you understand the differences in results between the two types of algorithms.

Control

Write a program to calculate N50, L50 (optionally other parameters), the number, and the total length of contigs. Compare the results of all three assemblies, the best one will be used further. You can also use the program quast to compare different assembly versions.
Why? You will understand the metrics used to evaluate the quality of assemblies.

Select a very long scaffold from the best assembly and map/align the contigs (scaffolds) from the other two assemblies to it using BLAST or FASTA. (Blast is installed on the VM, check the blastn --outfmt switch.) Show the result into a clear image/slide/table. Why? Visualize the differences between the approaches and understand the significance of the datasets.

Data

The data contains three libraries: paired-end from Illumina and two mate-pair from 454. The Illumina dataset is preprocessed: during quality control, some reads were removed, and the second paired sequence appears as single-end. To reduce computational complexity, we combined paired-end pairs that were so close to each other that they overlapped (extendedFrags), turning them into single-end reads as well. The remaining reads stayed paired (notCombined).

files: (GAA2024/E8_data)

	trim_database.fna

single end (second pair is not there), MiSeq:
	Bcc7419-MiSeq-A895A-PE_1_U.fastq 
	Bcc7419-MiSeq-A895A-PE_2_U.fastq  
single end (joined paired-end), MiSeq:
	Bcc7419-MiSeq-A895A-PE_12_JOIN_P.extendedFrags.fastq
paired-end (paired-end without overlap), 600bp, MiSeq:
	Bcc7419-MiSeq-A895A-PE_12_JOIN_P.notCombined_1.fastq
	Bcc7419-MiSeq-A895A-PE_12_JOIN_P.notCombined_2.fastq

mate pairs, 3kbp, 454:
	Bcc7419-454-HB0RHHA02-PE_3k-UNIQ.sff
	Bcc7419-454-HAV0LKU05-PE_3k-UNIQ.sff
      

Hints

hints for OLC:

	/mnt/shared/454/bin/runAssembly --version
	/mnt/shared/454/bin/newAssembly $NAME
	/mnt/shared/454/bin/addRun -lib SE_Illu -np $NAME /mnt/shared/GAA2024/E8_data/Bcc7419-MiSeq-A895A-PE_12_JOIN_P.extendedFrags.fastq
	...
	/mnt/shared/454/bin/addRun -lib PE_3k -p $NAME /mnt/shared/GAA2024/E8_data/Bcc7419-454-HAV0LKU05-PE_3k-UNIQ.sff
	...
	/mnt/shared/454/bin/runProject $NAME

	less $NAME/assembly/454AssemblyProject.xml
      

hints pro deBruin: check soap.config in the folder

	SOAPdenovo-63mer all -s soap.config -o $NAME
      

Time-stamp: <2025-11-26 13:46:50 (hpaces)>