Bioinformaticsのお勉強: EMBOSSのseqretを使いこなして、GENBANKからhomologの配列をまとめてごっそりとってくる！！

EMBOSSを用いて塩基配列をNCBIのRefSeqデータベースから取得してみる。

$ wossname
Finds programs by keywords in their short description
Text to search for, or blank to list all programs: read #EMBOSSに搭載されているreadという単語に関連のあるプログラムを検索。

SEARCH FOR 'READ'
aligncopy     Reads and writes alignments
aligncopypair Reads and writes pairs from alignments
featcopy      Reads and writes a feature table
featreport    Reads and writes a feature table
getorf        Finds and extracts open reading frames (ORFs)
nthseqset     Reads and writes (returns) one set of sequences from many
plotorf       Plot potential open reading frames in a nucleotide sequence
seqret        Reads and writes (returns) sequences
seqretsetall Reads and writes (returns) many sets of sequences
seqretsplit   Reads sequences and writes them to individual files
skipseq       Reads and writes (returns) sequences, skipping first few

#seqretは、配列を読み込んで結果を書き出すって書いてある！！これが使えそう！！

$ wossname
Finds programs by keywords in their short description
Text to search for, or blank to list all programs: database #次にデータベースに関わるプログラムを検索
SEARCH FOR 'DATABASE'
aaindexextract Extract amino acid property data from AAINDEX
cutgextract    Extract codon usage tables from from CUTG database
dbiblast       Index a BLAST database
dbifasta       Index a fasta file database
dbiflat        Index a flat file database
dbigcg         Index a GCG formatted database
dbxfasta       Index a fasta file database using b+tree indices
dbxflat        Index a flat file database using b+tree indices
dbxgcg         Index a GCG formatted database using b+tree indices
entret         Retrieves sequence entries from flatfile databases and files
jaspextract    Extract data from JASPAR
patmatmotifs   Scan a protein sequence with motifs from the PROSITE database
printsextract Extract data from PRINTS database for use by pscan
prosextract    Processes the PROSITE motif database for use by patmatmotifs
rebaseextract Process the REBASE database for use by restriction enzyme applications
redata         Retrieve information from REBASE restriction enzyme database
showdb         Displays information on configured databases　　　　　
stssearch      Search a DNA database for matches with a set of STS primers
tfextract      Process TRANSFAC transcription factor database for use by tfscan
whichdb        Search all sequence databases for an entry and retrieve it

#showdbでデータベースの情報が表示できるらしい

$ showdb
Displays information on configured databases
# Name         Type ID Qry All Comment
# ============ ==== == === === =======
SpTrEMBL       P    OK -   -   Contains the translations of all coding sequences present in the EMBL sequence database not yet integrated in SWISS-PROT
pir            P    OK -   -   Protein Identification Resource.
refseqp        P    OK -   -   Database of protein information from REFSEQ
swall          P    OK -   -   A combined database of Swiss-Prot, SPTREMBL and TREMBLNEW. Does not contain REMTREMBL.
swissprot      P    OK -   -   Database of protein sequences by SIB and EBI.
tpir           P    OK OK OK PIR using NBRF access for 4 files
tsw            P    OK OK OK Swissprot native format with EMBL CD-ROM index
tswnew         P    OK OK OK Swissnew as 3 files in native format with EMBL CD-ROM index
twp            P    OK OK OK EMBL new in native format with EMBL CD-ROM index
DDBJNEW        N    OK -   -   DDBJNEW IDs
DDBJRELEASE    N    OK -   -   DDBJRELEASE IDs
GENBANK        N    OK -   -   NCBI IDs
embl           N    OK -   -   The EMBL nucleotide sequence database constitutes Europes primary nucleotide sequence resource. The database is produced in an international collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ).
refseq         N    OK -   -   Database providing non-redundant curated data representing knowledge of known genes
tembl          N    OK OK OK EMBL in native format with EMBL CD-ROM index
tgb            N    OK -   -   Genbank IDs
tgenbank       N    OK OK OK GenBank in native format with EMBL CD-ROM index

#これは面白い！！！今現在の環境で、SpTrEMBL - tgenbankまでのデータベースが使うことが可能みたいだ！！

#ここで、GENBANKとrefseqの違いについて井上　潤先生の記事を引用します
(http://www.geocities.jp/ancientfishtree/RefSeq.html)

ーーーーーーここからーーーーーーーーーー

RefSeq は，Reference Sequenceの略で、配列解析に "reference"（リファレンス）となるべき配列データベースのことです．NCBI のスタッフが，最も代表としてふさわしい (参照の基準となる) 遺伝子配列をGenBank などのデータベースから目で見て選んで，RefSeq データベースを作成しています

GenBank 研究者自身が投稿            同じ遺伝視座から複数のレコードがある     あらゆる生物 (250,000 種)

RefSeq NCBIが既存のデータから作成主な生物から一つのレコードに限られているモデル生物 (4000 種)

ーーーーーここまでーーーーーーーーーーー

GenBankはとにかく内容が盛りだくさんで、逆にノイズが多い。refseqはNCBIのスタッフが厳選した、”reference”になるもののようだ。

$ seqret　　　　　　　　　　　　　　　　　　　　　　#seqretと入力するのが合言葉
Reads and writes (returns) sequences
Input (gapped) sequence(s): refseq:NM_000546　　 #refseqのNM_000546を取ってきてっていう思いを込める。
output sequence(s) [nm_000546.fasta]: 　　　　　　#Enterを押す！！

$ ls
nm_000546.fasta

$ cat nm_000546.fasta　　　　　　#nm_000546.fastaの中身を見てみよう！！

>NM_000546 NM_000546.4 Homo sapiens tumor protein p53 (TP53), transcript variant 1, mRNA.
gattggggttttcccctcccatgtgctcaagactggcgctaaaagttttgagcttctcaa
aagtctagagccaccgtccagggagcaggtagctgctgggctccggggacactttgcgtt
cgggctgggagcgtgctttccacgacggtgacacgcttccctggattggcagccagactg
ccttccgggtcactgccatggaggagccgcagtcagatcctagcgtcgagccccctctga
gtcaggaaacattttcagacctatggaaactacttcctgaaaacaacgttctgtccccct
tgccgtcccaagcaatggatgatttgatgctgtccccggacgatattgaacaatggttca
ctgaagacccaggtccagatgaagctcccagaatgccagaggctgctccccccgtggccc
ctgcaccagcagctcctacaccggcggcccctgcaccagccccctcctggcccctgtcat
cttctgtcccttcccagaaaacctaccagggcagctacggtttccgtctgggcttcttgc
attctgggacagccaagtctgtgacttgcacgtactcccctgccctcaacaagatgtttt
gccaactggccaagacctgccctgtgcagctgtgggttgattccacacccccgcccggca
cccgcgtccgcgccatggccatctacaagcagtcacagcacatgacggaggttgtgaggc
gctgcccccaccatgagcgctgctcagatagcgatggtctggcccctcctcagcatctta
tccgagtggaaggaaatttgcgtgtggagtatttggatgacagaaacacttttcgacata
gtgtggtggtgccctatgagccgcctgaggttggctctgactgtaccaccatccactaca
actacatgtgtaacagttcctgcatgggcggcatgaaccggaggcccatcctcaccatca
tcacactggaagactccagtggtaatctactgggacggaacagctttgaggtgcgtgttt
gtgcctgtcctgggagagaccggcgcacagaggaagagaatctccgcaagaaaggggagc
ctcaccacgagctgcccccagggagcactaagcgagcactgcccaacaacaccagctcct
ctccccagccaaagaagaaaccactggatggagaatatttcacccttcagatccgtgggc
gtgagcgcttcgagatgttccgagagctgaatgaggccttggaactcaaggatgcccagg
ctgggaaggagccaggggggagcagggctcactccagccacctgaagtccaaaaagggtc
agtctacctcccgccataaaaaactcatgttcaagacagaagggcctgactcagactgac
attctccacttcttgttccccactgacagcctcccacccccatctctccctcccctgcca
ttttgggttttgggtctttgaacccttgcttgcaataggtgtgcgtcagaagcacccagg
acttccatttgctttgtcccggggctccactgaacaagttggcctgcactggtgttttgt
tgtggggaggaggatggggagtaggacataccagcttagattttaaggtttttactgtga
gggatgtttgggagatgtaagaaatgttcttgcagttaagggttagtttacaatcagcca
cattctaggtaggggcccacttcaccgtactaaccagggaagctgtccctcactgttgaa
ttttctctaacttcaaggcccatatctgtgaaatgctggcatttgcacctacctcacaga
gtgcattgtgagggttaatgaaataatgtacatctggccttgaaaccaccttttattaca
tggggtctagaacttgacccccttgagggtgcttgttccctctccctgttggtcggtggg
ttggtagtttctacagttgggcagctggttaggtagagggagttgtcaagtctctgctgg
cccagccaaaccctgtctgacaacctcttggtgaaccttagtacctaaaaggaaatctca
ccccatcccacaccctggaggatttcatctcttgtatatgatgatctggatccaccaaga
cttgttttatgctcagggtcaatttcttttttcttttttttttttttttttctttttctt
tgagactgggtctcgctttgttgcccaggctggagtggagtggcgtgatcttggcttact
gcagcctttgcctccccggctcgagcagtcctgcctcagcctccggagtagctgggacca
caggttcatgccaccatggccagccaacttttgcatgttttgtagagatggggtctcaca
gtgttgcccaggctggtctcaaactcctgggctcaggcgatccacctgtctcagcctccc
agagtgctgggattacaattgtgagccaccacgtccagctggaagggtcaacatctttta
cattctgcaagcacatctgcattttcaccccacccttcccctccttctccctttttatat
cccatttttatatcgatctcttattttacaataaaactttgctgccacctgtgtgtctga
ggggtg

#あとは、マルティプルアライメントを視野に入れて、複数の配列を一挙に読み込んで、一つのテキストファイルに書き出す方法を行ってみる。

#今回はアポトーシスに重要なカスパーゼの７についてHomologeneで引っかかったホモログをrefseqからアミノ配列で並べてみる。

$ vim CASP7.txt　　　3 vim CAASP7.txtというファイルをvエディタで作成しますよ。

a

GENBANK:NP_203124
GENBANK:XP_508043
GENBANK:XP_544026
GENBANK:XP_604643
GENBANK:NP_031637
GENBANK:NP_071596
GENBANK:XP_421764
GENBANK:NP_001018443
GENBANK:NP_524551
GENBANK:NP_476974
GENBANK:XP_316795

Esc

:wq       #上書き保存して終了

#refseqはNMはうまくできたが、NPができなかった。

$ ls
CASP7.txt nm_000546.fasta

$ seqret @CASP7.txt -outseq=CASP7.fasta

$

>NP_203124 NP_203124.1 caspase-7 isoform delta [Homo sapiens].
mdcvgwppgrkwhlekntscggssgicasyvtqmaddqgcieeqgvedsanedsvdakpd
rssfvpslfskkkknvtmrsikttrdrvptyqynmnfeklgkciiinnknfdkvtgmgvr
ngtdkdaealfkcfrslgfdvivyndcscakmqdllkkaseedhtnaacfacillshgee
nviygkdgvtpikdltahfrgdrcktllekpklffiqacrgtelddgiqadsgpindtda
nprykipveadflfaystvpgyyswrspgrgswfvqalcsileehgkdleimqiltrvnd
rvarhfesqsddphfhekkqipcvvsmltkelyfsq
>XP_508043 XP_508043.2 PREDICTED: similar to Lice2 beta cysteine protease isoform 2 [Pan troglodytes].
mdcvgwppgrkwhlekntscggssgicasyvtqmaddqgcieeqgvedsanedsvdakpd
rssfvpslfskkkknvtmrsikttrdrvptyqynmnfeklgkciiinnknfdkvtgmgvr
ngtdkdaealfkcfrslgfdvivyndcscakmqdllkkaseedhtnaacfacillshgee
nviygkdgvtpikdltahfrgdrcktllekpklffiqacrgtelddgiqadsgpindtda
nprykipveadflfaystvpgyyswrspgrgswfvqalcsileehgkdleimqiltrvnd
rvarhfesqsddprfhekkqipcvvsmltkelyfsq
(省略）

q

$ clustalw CASP7.fasta

$ ls
CASP7.aln CASP7.dnd CASP7.fasta CASP7.txt nm_000546.fasta nm_033338.fasta

$ less CASP7.aln

CLUSTAL 2.0.10 multiple sequence alignment

XP_421764         ----------------------MALSTRQNSNGQAC---------------------IMS
NP_001018443      ----------------------MCRVDKEALTSENE---------------------VLE
NP_031637         ----------------------------------------------------------MT
NP_071596         ----------------------------------------------------------MT
NP_203124         ----------MDCVGWPPGR--KWHLEKNTSCGGSSGICAS-------------YVTQMA
XP_508043         ----------MDCVGWPPGR--KWHLEKNTSCGGSSGICAS-------------YVTQMA
XP_544026         MLTSWAPPLKLDSFSFPPPEVLTCALGHNVFCQVDAFLFSDSENIPGYQCSGLFFSFQMA
XP_604643         ----------------------------------------------------------MA
NP_524551         ------MDATNNGESADQ-VGIRVGNPEQPNDHTDA---------------------LGS
NP_476974         --------MTDECVTRNYGVGIRSPNGSENRGSFIM---------------------ADN
XP_316795         -------------MSMPDVVDIRNEINKEHVRESLD---------------------RNG

(省略）

$ R

library(ape)
library(seqinr)

png("110206_casp7.png")
casp7 <- read.alignment(file="CASP7.aln", format="clustal")
plot(nj(dist.alignment(casp7)), type="unrooted")
dev.off()

Bioinformaticsのお勉強

EMBOSSのseqretを使いこなして、GENBANKからhomologの配列をまとめてごっそりとってくる！！

自己紹介

過去のブログ♪♪♪