EMBOSSを用いて塩基配列をNCBIのRefSeqデータベースから取得してみる。
$ wossname
Finds programs by keywords in their short description
Text to search for, or blank to list all programs: read #EMBOSSに搭載されているreadという単語に関連のあるプログラムを検索。
SEARCH FOR 'READ'
aligncopy Reads and writes alignments
aligncopypair Reads and writes pairs from alignments
featcopy Reads and writes a feature table
featreport Reads and writes a feature table
getorf Finds and extracts open reading frames (ORFs)
nthseqset Reads and writes (returns) one set of sequences from many
plotorf Plot potential open reading frames in a nucleotide sequence
seqret Reads and writes (returns) sequences
seqretsetall Reads and writes (returns) many sets of sequences
seqretsplit Reads sequences and writes them to individual files
skipseq Reads and writes (returns) sequences, skipping first few
#seqretは、配列を読み込んで結果を書き出すって書いてある!!これが使えそう!!
$ wossname
Finds programs by keywords in their short description
Text to search for, or blank to list all programs: database #次にデータベースに関わるプログラムを検索
SEARCH FOR 'DATABASE'
aaindexextract Extract amino acid property data from AAINDEX
cutgextract Extract codon usage tables from from CUTG database
dbiblast Index a BLAST database
dbifasta Index a fasta file database
dbiflat Index a flat file database
dbigcg Index a GCG formatted database
dbxfasta Index a fasta file database using b+tree indices
dbxflat Index a flat file database using b+tree indices
dbxgcg Index a GCG formatted database using b+tree indices
entret Retrieves sequence entries from flatfile databases and files
jaspextract Extract data from JASPAR
patmatmotifs Scan a protein sequence with motifs from the PROSITE database
printsextract Extract data from PRINTS database for use by pscan
prosextract Processes the PROSITE motif database for use by patmatmotifs
rebaseextract Process the REBASE database for use by restriction enzyme applications
redata Retrieve information from REBASE restriction enzyme database
showdb Displays information on configured databases
stssearch Search a DNA database for matches with a set of STS primers
tfextract Process TRANSFAC transcription factor database for use by tfscan
whichdb Search all sequence databases for an entry and retrieve it
#showdbでデータベースの情報が表示できるらしい
$ showdb
Displays information on configured databases
# Name Type ID Qry All Comment
# ============ ==== == === === =======
SpTrEMBL P OK - - Contains the translations of all coding sequences present in the EMBL sequence database not yet integrated in SWISS-PROT
pir P OK - - Protein Identification Resource.
refseqp P OK - - Database of protein information from REFSEQ
swall P OK - - A combined database of Swiss-Prot, SPTREMBL and TREMBLNEW. Does not contain REMTREMBL.
swissprot P OK - - Database of protein sequences by SIB and EBI.
tpir P OK OK OK PIR using NBRF access for 4 files
tsw P OK OK OK Swissprot native format with EMBL CD-ROM index
tswnew P OK OK OK Swissnew as 3 files in native format with EMBL CD-ROM index
twp P OK OK OK EMBL new in native format with EMBL CD-ROM index
DDBJNEW N OK - - DDBJNEW IDs
DDBJRELEASE N OK - - DDBJRELEASE IDs
GENBANK N OK - - NCBI IDs
embl N OK - - The EMBL nucleotide sequence database constitutes Europes primary nucleotide sequence resource. The database is produced in an international collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ).
refseq N OK - - Database providing non-redundant curated data representing knowledge of known genes
tembl N OK OK OK EMBL in native format with EMBL CD-ROM index
tgb N OK - - Genbank IDs
tgenbank N OK OK OK GenBank in native format with EMBL CD-ROM index
#これは面白い!!!今現在の環境で、SpTrEMBL - tgenbankまでのデータベースが使うことが可能みたいだ!!
#ここで、GENBANKとrefseqの違いについて井上 潤先生の記事を引用します
(http://www.geocities.jp/ancientfishtree/RefSeq.html)
ーーーーーーここからーーーーーーーーーー
RefSeq は,Reference Sequenceの略で、配列解析に "reference"(リファレンス)となるべき配列データベースのことです.NCBI のスタッフが,最も代表としてふさわしい (参照の基準となる) 遺伝子配列をGenBank などのデータベースから目で見て選んで,RefSeq データベースを作成しています
GenBank 研究者自身が投稿 同じ遺伝視座から複数のレコードがある あらゆる生物 (250,000 種)
RefSeq NCBIが既存のデータから作成 主な生物から一つのレコードに限られている モデル生物 (4000 種)
ーーーーーここまでーーーーーーーーーーー
GenBankはとにかく内容が盛りだくさんで、逆にノイズが多い。refseqはNCBIのスタッフが厳選した、”reference”になるもののようだ。
$ seqret #seqretと入力するのが合言葉
Reads and writes (returns) sequences
Input (gapped) sequence(s): refseq:NM_000546 #refseqのNM_000546を取ってきてっていう思いを込める。
output sequence(s) [nm_000546.fasta]: #Enterを押す!!
$ ls
nm_000546.fasta
$ cat nm_000546.fasta #nm_000546.fastaの中身を見てみよう!!
>NM_000546 NM_000546.4 Homo sapiens tumor protein p53 (TP53), transcript variant 1, mRNA.
gattggggttttcccctcccatgtgctcaagactggcgctaaaagttttgagcttctcaa
aagtctagagccaccgtccagggagcaggtagctgctgggctccggggacactttgcgtt
cgggctgggagcgtgctttccacgacggtgacacgcttccctggattggcagccagactg
ccttccgggtcactgccatggaggagccgcagtcagatcctagcgtcgagccccctctga
gtcaggaaacattttcagacctatggaaactacttcctgaaaacaacgttctgtccccct
tgccgtcccaagcaatggatgatttgatgctgtccccggacgatattgaacaatggttca
ctgaagacccaggtccagatgaagctcccagaatgccagaggctgctccccccgtggccc
ctgcaccagcagctcctacaccggcggcccctgcaccagccccctcctggcccctgtcat
cttctgtcccttcccagaaaacctaccagggcagctacggtttccgtctgggcttcttgc
attctgggacagccaagtctgtgacttgcacgtactcccctgccctcaacaagatgtttt
gccaactggccaagacctgccctgtgcagctgtgggttgattccacacccccgcccggca
cccgcgtccgcgccatggccatctacaagcagtcacagcacatgacggaggttgtgaggc
gctgcccccaccatgagcgctgctcagatagcgatggtctggcccctcctcagcatctta
tccgagtggaaggaaatttgcgtgtggagtatttggatgacagaaacacttttcgacata
gtgtggtggtgccctatgagccgcctgaggttggctctgactgtaccaccatccactaca
actacatgtgtaacagttcctgcatgggcggcatgaaccggaggcccatcctcaccatca
tcacactggaagactccagtggtaatctactgggacggaacagctttgaggtgcgtgttt
gtgcctgtcctgggagagaccggcgcacagaggaagagaatctccgcaagaaaggggagc
ctcaccacgagctgcccccagggagcactaagcgagcactgcccaacaacaccagctcct
ctccccagccaaagaagaaaccactggatggagaatatttcacccttcagatccgtgggc
gtgagcgcttcgagatgttccgagagctgaatgaggccttggaactcaaggatgcccagg
ctgggaaggagccaggggggagcagggctcactccagccacctgaagtccaaaaagggtc
agtctacctcccgccataaaaaactcatgttcaagacagaagggcctgactcagactgac
attctccacttcttgttccccactgacagcctcccacccccatctctccctcccctgcca
ttttgggttttgggtctttgaacccttgcttgcaataggtgtgcgtcagaagcacccagg
acttccatttgctttgtcccggggctccactgaacaagttggcctgcactggtgttttgt
tgtggggaggaggatggggagtaggacataccagcttagattttaaggtttttactgtga
gggatgtttgggagatgtaagaaatgttcttgcagttaagggttagtttacaatcagcca
cattctaggtaggggcccacttcaccgtactaaccagggaagctgtccctcactgttgaa
ttttctctaacttcaaggcccatatctgtgaaatgctggcatttgcacctacctcacaga
gtgcattgtgagggttaatgaaataatgtacatctggccttgaaaccaccttttattaca
tggggtctagaacttgacccccttgagggtgcttgttccctctccctgttggtcggtggg
ttggtagtttctacagttgggcagctggttaggtagagggagttgtcaagtctctgctgg
cccagccaaaccctgtctgacaacctcttggtgaaccttagtacctaaaaggaaatctca
ccccatcccacaccctggaggatttcatctcttgtatatgatgatctggatccaccaaga
cttgttttatgctcagggtcaatttcttttttcttttttttttttttttttctttttctt
tgagactgggtctcgctttgttgcccaggctggagtggagtggcgtgatcttggcttact
gcagcctttgcctccccggctcgagcagtcctgcctcagcctccggagtagctgggacca
caggttcatgccaccatggccagccaacttttgcatgttttgtagagatggggtctcaca
gtgttgcccaggctggtctcaaactcctgggctcaggcgatccacctgtctcagcctccc
agagtgctgggattacaattgtgagccaccacgtccagctggaagggtcaacatctttta
cattctgcaagcacatctgcattttcaccccacccttcccctccttctccctttttatat
cccatttttatatcgatctcttattttacaataaaactttgctgccacctgtgtgtctga
ggggtg
#あとは、マルティプルアライメントを視野に入れて、複数の配列を一挙に読み込んで、一つのテキストファイルに書き出す方法を行ってみる。
#今回はアポトーシスに重要なカスパーゼの7についてHomologeneで引っかかったホモログをrefseqからアミノ配列で並べてみる。
$ vim CASP7.txt 3 vim CAASP7.txtというファイルをvエディタで作成しますよ。
a
GENBANK:NP_203124
GENBANK:XP_508043
GENBANK:XP_544026
GENBANK:XP_604643
GENBANK:NP_031637
GENBANK:NP_071596
GENBANK:XP_421764
GENBANK:NP_001018443
GENBANK:NP_524551
GENBANK:NP_476974
GENBANK:XP_316795
Esc
:wq #上書き保存して終了
#refseqはNMはうまくできたが、NPができなかった。
$ ls
CASP7.txt nm_000546.fasta
$ seqret @CASP7.txt -outseq=CASP7.fasta
$
>NP_203124 NP_203124.1 caspase-7 isoform delta [Homo sapiens].
mdcvgwppgrkwhlekntscggssgicasyvtqmaddqgcieeqgvedsanedsvdakpd
rssfvpslfskkkknvtmrsikttrdrvptyqynmnfeklgkciiinnknfdkvtgmgvr
ngtdkdaealfkcfrslgfdvivyndcscakmqdllkkaseedhtnaacfacillshgee
nviygkdgvtpikdltahfrgdrcktllekpklffiqacrgtelddgiqadsgpindtda
nprykipveadflfaystvpgyyswrspgrgswfvqalcsileehgkdleimqiltrvnd
rvarhfesqsddphfhekkqipcvvsmltkelyfsq
>XP_508043 XP_508043.2 PREDICTED: similar to Lice2 beta cysteine protease isoform 2 [Pan troglodytes].
mdcvgwppgrkwhlekntscggssgicasyvtqmaddqgcieeqgvedsanedsvdakpd
rssfvpslfskkkknvtmrsikttrdrvptyqynmnfeklgkciiinnknfdkvtgmgvr
ngtdkdaealfkcfrslgfdvivyndcscakmqdllkkaseedhtnaacfacillshgee
nviygkdgvtpikdltahfrgdrcktllekpklffiqacrgtelddgiqadsgpindtda
nprykipveadflfaystvpgyyswrspgrgswfvqalcsileehgkdleimqiltrvnd
rvarhfesqsddprfhekkqipcvvsmltkelyfsq
(省略)
q
$ clustalw CASP7.fasta
$ ls
CASP7.aln CASP7.dnd CASP7.fasta CASP7.txt nm_000546.fasta nm_033338.fasta
$ less CASP7.aln
CLUSTAL 2.0.10 multiple sequence alignment
XP_421764 ----------------------MALSTRQNSNGQAC---------------------IMS
NP_001018443 ----------------------MCRVDKEALTSENE---------------------VLE
NP_031637 ----------------------------------------------------------MT
NP_071596 ----------------------------------------------------------MT
NP_203124 ----------MDCVGWPPGR--KWHLEKNTSCGGSSGICAS-------------YVTQMA
XP_508043 ----------MDCVGWPPGR--KWHLEKNTSCGGSSGICAS-------------YVTQMA
XP_544026 MLTSWAPPLKLDSFSFPPPEVLTCALGHNVFCQVDAFLFSDSENIPGYQCSGLFFSFQMA
XP_604643 ----------------------------------------------------------MA
NP_524551 ------MDATNNGESADQ-VGIRVGNPEQPNDHTDA---------------------LGS
NP_476974 --------MTDECVTRNYGVGIRSPNGSENRGSFIM---------------------ADN
XP_316795 -------------MSMPDVVDIRNEINKEHVRESLD---------------------RNG
(省略)
$ R
library(ape)
library(seqinr)
png("110206_casp7.png")
casp7 <- read.alignment(file="CASP7.aln", format="clustal")
plot(nj(dist.alignment(casp7)), type="unrooted")
dev.off()