Genomics: analysis and algorithms

DNA vs. PROTEIN

Here is the sequence of very conserved yeast protein. You would like to find it's plant homolog.

>Scsec19 
GGGATTGTAGATGTAGTTTCAACACGTCGGCTGATTTATCCCGATTTTGTTAGTAGAAAAGGTTCTACTT 
CATTCTTGCTTGAGACGTCGTCCCATCAAATTTCTAACATAGTCTTTTTTCAAGGAAGGATATTTTTCAA 
AGCAGGACTGCAATTAGTCTTTTCCTTTTCTTTACTCCCCTTCCATCATAACTGTTAGTGAATAACCACT 
TATATAGCATAACACAATGGATCAAGAAACAATAGACACTGACTACGACGTGATTGTCTTAGGTACCGGT 
ATTACCGAATGTATCTTATCTGGTTTACTCTCTGTAGATGGAAAAAAGGTATTACATATTGACAAGCAAG 
ACCATTATGGTGGCGAAGCTGCTTCTGTGACCTTATCTCAATTGTATGAAAAATTTAAACAAAATCCGAT 
CAGTAAAGAGGAACGGGAGTCCAAGTTTGGTAAAGATAGAGATTGGAATGTCGACTTAATTCCTAAATTC 
CTGATGGCCAATGGTGAGCTGACAAATATTTTAATACATACCGATGTGACCAGATATGTCGATTTCAAGC 
AAGTTTCTGGCTCCTACGTTTTTAAGCAAGGCAAAATTTACAAAGTGCCAGCTAATGAAATAGAAGCCAT 
TTCATCGCCATTGATGGGTATTTTTGAAAAACGTAGAATGAAGAAATTTTTAGAATGGATTAGCTCTTAC 
AAAGAAGATGACTTGTCCACTCATCAAGGATTAGACTTAGACAAGAATACCATGGATGAAGTGTATTATA 
AATTTGGGTTAGGCAATTCTACCAAAGAATTCATCGGTCATGCAATGGCTTTATGGACCAATGATGACTA 
CTTACAACAACCTGCTAGGCCATCGTTTGAGAGGATTTTGTTATATTGCCAAAGTGTTGCCCGTTACGGT 
AAATCACCTTATTTGTATCCTATGTATGGGTTAGGCGAACTTCCACAAGGATTTGCTCGTTTGTCGGCTA 
TTTACGGTGGTACTTACATGCTAGACACTCCAATTGATGAAGTATTGTATAAAAAAGACACAGGAAAATT 
TGAAGGGGTCAAGACTAAGCTGGGAACTTTCAAGGCCCCATTGGTTATTGCTGATCCAACTTATTTTCCC 
GAAAAATGTAAATCTACTGGTCAAAGAGTTATTAGAGCCATCTGTATTCTTAACCATCCAGTTCCGAACA 
CCAGTAACGCGGATTCTTTACAAATTATTATCCCACAAAGCCAACTGGGAAGGAAAAGCGATATATACGT 
TGCGATTGTTTCAGATGCGCATAACGTTTGCTCCAAGGGTCACTATTTAGCAATTATTTCTACAATCATT 
GAAACTGATAAACCACATATAGAATTAGAGCCTGCTTTCAAACTTCTGGGACCAATCGAAGAAAAATTCA 
TGGGAATTGCCGAATTATTTGAACCAAGAGAAGACGGCTCTAAGGATAACATTTACTTATCCAGATCATA 
CGACGCATCCTCTCATTTCGAATCCATGACTGACGATGTTAAAGATATTTACTTCAGAGTAACAGGCCAC 
CCATTAGTTCTAAAACAAAGACAAGAACAAGAAAAGCAGTAAATTCATACCTTTACGACTAAAGCAGCAA 
TTGGAGGGTAAACTTATTTTTTCC

Similarity matrices

Compare two sequences using BLATSP or FASTA program. Use three different similarity matrices and different fap penalties. Summarize somehow what was the effect on results.

Sequences:

 >gi_13397640 unknown protein, Brassica napus
 MSSAPSPGTGSPPSPPSNSTTTTPPPASAPPPTTPSSPPPPSTIPTSPPPSSRSTPSAPPPSPPTPSTPG
 SPPPLPQPSPPAPTTPGSPPAPVTPPTRNPPPSVPGPPSNPSREGGSPRPPSSPSPPSPSSDGLSTGVVV
 GIAIGGVALLVIVTLICLLCKKKRRRDEEDAYYVPPPPPPGPKAGGPYGGQQQQWRQQNATPPSDHVVTS
 LPPPPKAPSPPRQPPPPPPPPFMSSSGGSDYSDRPVLPPPSPGLVLGFSKSTFTYEELARATNGFSEANL
 LGQGGFGYVHKGVLPSGKEVAVKQLKVGSGQGEREFQAEVEIISRVHHRHLVSLVGYCIAGAKRLLVYEF
 VPNNNLELHLHGEGRPTMEWSTRLKIALGSAKGLSYLHEDCNPKIIHRDIKASNILIDFKFEAKVADFGL
 AKIASDTNTHVSTRVMGTFGYLAPEYAASGKLTEKSDVFSFGVVLLELITGRRPVDANNVYVDDSLVDWA
 RPLLNRASEQGDFEGLADAKMNNGYDREEMARMVACAAACVRHSARRRPRMSQIVRALEGNVSLSDLNEG
 MRPGQSNVYSSYGGSTDYDSSQYNEDMKKFRKMALGTQEYNATGEYSNPTSDYGLYPSGSSSEGQTTREM
 EMGKIKRTGQGYSGPSL
 >gi_1345852_sp_P41242|MATK_MOUSE Megakaryocyte-associated tyrosine-protein kinase
 MARRSSRVSWLAFEGWESRDLPRVSPRLFGAWHPAPAAARMPTRWAPGTQCMTKCENSRPKPGELAFRKG
 DMVTILEACEDKSWYRAKHHGSGQEGLLAAAALRHGEALSTDPKLSLMPWFHGKISGQEAIQQLQPPEDG
 LFLVRESARHPGDYVLCVSFGRDVIHYRVLHRDGHLTIDEAVCFCNLMDMVEHYTKDKGAICTKLVKPRR
 KQGAKSAEEELAKAGWLLDLQHLTLGAQIGEGEFGAVLQGEYLGQKVAVKNIKCDVTAQAFLDETAVMTK
 LQHRNLVRLLGVILHHGLYIVMEHVSKGNLVNFLRTRGRALVSTSQLLQFALHVAEGMEYLESKKLVHRD
 LAARNILVSEDLVAKVSDFGLAKAERKGLDSSRLPVKWTAPEALKNGRFSSKSDVWSFGVLLWEVFSYGR
 APYPKMSLKEVSEAVEKGYRMEPPDGCPGSVHTLMGSCWEAEPARRPPFRKIVEKLGRELRSVGVSAPAG
 GQEAEGSAPTRSQDP

Biological thinking

You got DNA sequence from sequencing facility, DNA encode bacterial gene. Sequencing servis marked your sequence as probably errorneous, you have to expect erros and frameshifts.

>rc-m2-30
TGCCCTGCGCCGCGCTATTCGACGCCATCATGGACTGCCTGAAGGAGCATGGCGAGGTGCGCACCATTCG
CGTGGCTGCGGCGGACGTGAACGGGGTGGCAACGGGTAAGCGCATACCCGCACGTTTCGCAAGCAAGGTT
TTTTCCGAGGGAACACGGTAACCGTTCTCGGTGATGAACCTCGACATCTGGGGCGAGGACATCGAGGAAA
GTCCGCTGGTTTTTGAAACCGGCCTCTGCGATGGCCTGTTGCGCGCGACCGAGAGGCCCTTCATGCCGAT
GCCCTGGCTCGACCCACCGACGGCGCTACTGCCGATCTGGATGTATCACATGGATGGCCGCCCCTATTCG
GCCCTCCACGGCAGGCGCTGGCGGCGGTCAAGGACCGCTACACCGTAAAGGGCCTGACGGGCGTGGTGGC
GACGGAACTTGAAGCTGCTGTGATCGACGACAGCGGCACGATTCTGCGCGTGCCGCCCTCGCCCCGTTCC
GGCAAGCGCCGCACCGGGCCCGAAATCCTGTCGCTGCGTTCGCTTGACGCCTTTGACGGCTTCTTCACCG
CGCTTTACGCGGCCTGCGAGGTGATGGACATTCCGGCAGATATGGCGATTTCCGAAACCGCCTCGGGGCA
GTTCGAGATCAACCTGATGCAGTAGGCCGATCCGCGGAAGTCCGCCGATGACACCTGGCTGTTCAAGATG
CTGGTCAAGGGTCTGGCGCGGCAGCACGGCTATGCCGCCTCGCCCATGGCGAAACCCAATGATCTGTGGT
CGGGCAACGGGATGCGCGGGCATTTCTCGACCCTCGATCAGAACGGCGAAAACATCTTCAACCTGGGCAC
CGAAAAGGGCTCGGATGCGTTGCTGTCCGCGGTGGCGGGCTATCTGGCGGCGCTGCCGGGACCGACGCTG
ATCTTTGCGGTGGTTCAGAACAGCTACACCCAGCAGGTGCCCAATGCCCGTGTGTCTACGCGAATTGTCT
GGGCCTATGAGAACCGCGCGGGGTTTTTGCGGATCCCGTCTTCGGGGCACGCGGCGCGGCGGATCGAGCA
GTCGGGTGGCGTGGGGCGACGTGAACCCCTATCTGATAATCGCCGCTAGCCTTGGTGCGGCGCTGGTCGG
GCTCGTAGACAAAATGGTCCCCGACGAGCCGATCGTCGACAACGCTGATGCGAAATATCTGCCGCACCCG
CCCGCAACGTGGAAACTCGAGATAACCCTGTTCGACAGCTGCCCGCTGATCAAGCGCATCTTTGTAGAAG
AGCTGATCGAGAACTCCCTGATGACCAAGCGTTAGGAGATCCACTACATGGCGGCGCTGTCCGAAGAGTA
GCAGACCGAGCTTTACCTCGCCATCGTCGCCCTGCGTGATCGCGTACCGAC
Time-stamp: <2024-11-06 13:47:18 (hpaces)>