ncbiのftpサーバーから種ごとのrefseqのmRNAを取ってくる、からのlocal blastのためのデータベース構築(not human)

#ncbiのftpサーバーにアクセスする
$ ftp ftp.ncbi.nlm.nih.gov

> cd ./refseq

> ls

dr-xr-xr-x   3 ftp      anonymous     4096 Mar 10  2008 B_taurus
dr-xr-xr-x   3 ftp      anonymous     4096 Feb 19  2004 D_rerio
-r--r--r--   1 ftp      anonymous    10585 Feb 28  2005 FTP_CHANGE_NOTICE
dr-xr-xr-x   5 ftp      anonymous     4096 Mar 13  2008 H_sapiens
dr-xr-xr-x   3 ftp      anonymous     4096 Apr 27  2007 LocusLink
dr-xr-xr-x   3 ftp      anonymous     4096 Dec 20  2001 M_musculus
-r--r--r--   1 ftp      anonymous     9692 Jun 11  2010 README
dr-xr-xr-x   3 ftp      anonymous     4096 Jun 23  2000 R_norvegicus
dr-xr-xr-x   3 ftp      anonymous     4096 Mar 10  2008 X_tropicalis
dr-xr-xr-x   3 ftp      anonymous    28672 Jul 29 06:19 daily
dr-xr-xr-x  17 ftp      anonymous     4096 Jul  8 13:18 release
dr-xr-xr-x   3 ftp      anonymous    86016 Jul 29 07:00 removed
dr-xr-xr-x   5 ftp      anonymous     8192 Jul 27 03:06 special_requests
dr-xr-xr-x   2 ftp      anonymous     4096 Oct  2  2007 uniprotkb
dr-xr-xr-x   3 ftp      anonymous    98304 Jul 28 05:58 wgs

> cd D_rerio

> ls

-r--r--r--   1 ftp      anonymous     2762 Mar 10  2008 README
dr-xr-xr-x   3 ftp      anonymous     4096 Jul 25 15:29 mRNA_Prot

> cd mRNA_Prot

dr-xr-xr-x   2 ftp      anonymous     4096 Jul 12  2007 tmpold            
-r--r--r--   1 ftp      anonymous  8947160 Jul 25 15:26 zebrafish.protein.faa.gz
-r--r--r--   1 ftp      anonymous 22553087 Jul 25 15:26 zebrafish.protein.gpff.gz
-r--r--r--   1 ftp      anonymous 18512295 Jul 25 15:26 zebrafish.rna.fna.gz
-r--r--r--   1 ftp      anonymous 53048114 Jul 25 15:26 zebrafish.rna.gbff.gz

> get zebrafish.rna.fna.gz

#ftpサーバーから抜ける
> exit

# .gzを解凍
$ gunzip zebrafish.rna.fna.gz

#解答したファイルの中身の確認

$ less zebrafish.rna.fna


>gi|68369925|ref|XM_696518.1| PREDICTED: Danio rerio si:rp71-1g13.2 (si:rp71-1g13.2), mRNA
ATGGACTCTTTTCAGAAAGAAATAGAGAAGTATGAAGTAGTGATAAGGTTTAAAGAAACAAACCAAGAAATTATAAAGAA
AGCAAACCCATTTGGGTTAACAACTAGCCTGGCAAATAAAATAGGACAGATAGAGTACGCAAAGATCCTTAATGATGGTA
ACCTACTAATAAGATGTGCTGACGCTGGGCAAATGGAAAAAGCCCTAAAAATTAAGGATGTGGTCAAATGTAAGGTGGAG
AATACAGCTAGGGTGGGAATGGGAAGGAAATGTGTAGCTAAAGGGGTGATCACAGGAGTATCATTAAGTATAACAGAAGA
AGAAATGAAAAAGAATATAAAAGGAGCAAAAGTAGTGAATGTTACAAGAATGAAAACAACTAGAGATGGAGAAGCTAAAG
ACAGTAAAACCGTGCTATTAGAATTCGATGAAGTGGTAGTGCCAAAGAAAGTATTTCTTGAATTTGTAAATTATCCAGTG
AGATTGTATGTACCAAAACCATTGAGGTGCTATAACTGCCAAAGATTTGACCACACAGCAAAAATCTGTAATAGGCAAAG
AAGGTGTGCAAGGTGTGGAGGGGATCATGATTACGAAAACTGTGGAGCAGGCGTTCAACCAAAATGTTGCAATTGCGGAG
GTGCTCACAATGTGGCATTCAGTGGATGTGAAGTCATGCAGAGGGAGACAAATATACAAAAGATAAGAGTGGAGAAAAAA
ATCACATACGCTGAAGCGGTTAAAGTGTCAAGAGAAAAGAAAACCAAAGAAAATGAAGTGGTTATGGATTCTCAACAGCA
AGACAATTCAGAGAAAATCTACGTCAAAAATAAAAGAACTAGTAACGTTTATAGCAGGTGTGATAAATAG
>gi|18859030|ref|NM_131477.1| Danio rerio major histocompatibility complex class II DEB gene (mhc2deb), mRNA
ATGTCTTTGCAAAACCTTTTTATTTTTCATCTCCTGTTGTTTCTATTTCCTGACGGGTATTATCACAGTAGGCTTACAAA
ATGCATCTTCCAGCTCCAGGATCTCAGTGACATAGGTGTTCATGATAATTATATCTTCAATAAAGATGTGTACATACGAT
TCAACAGCACTTTGGGGTACTTTGTTGGGTACACTGAACATGGAGTATATAATGCACAATTATGGAAGCAACGATACCAG
CTTCTCGAGCAAGAGAGAGCTCACGAGGATCGATTCTGCAAATACAATGCTGAGATTGACTACAACAACATTCTAGGAAA
AACAGTAAAACCACAGGTTAAGCTTAATTCAGTGAAGCAGGCTGGTGGCAGACAGCCAGCTGTGTTGGTGTGCAGTGCAT
ATGACTTCTATCCCAAAAGAATCAAAGTCACCTGGTTAAGAAATGGTAAACCAGTGACCACTGATGTAACCTCCACTGAG
GAGCTGGCTGATGGGGACTGGTACTACCAAATTCATTCCCACCTGGAATACACCCCCAAATCTGGAGAAAAGATTTCCTG
TATGGTGGATCATGCCAGCTCAACTGAACCCATCATCATAGCCTGGGATTCATCTCTCTCTGAGCCTGAGAGGAATAAAA
TTGCTATTGGAGCATCTGGTTTGGTGCTGGGAATCATCATTGCCACTGCTGGACTCATTTATTACAAGAAGAAATCATCA
GGTCAGTTTAAATAA
>gi|18859028|ref|NM_131706.1| Danio rerio major histocompatibility complex class II DCB gene (mhc2dcb), mRNA
ATGATTTTGTCTGCTTTATTGGAAAAAGTATGTGGAAATTACGGCTATCTTCAAAGTCAATGTCGAGTACTGAGCTCTAC
AAAGAAAGTTGAGCTCATCTTCTCATTCATCTTCAACAAGATTGAATACATTAGATACAACAGTACTGATCAGAAAATTG
TTGGCTACACTGAATTTGGAGAGAAATTTGTTGAAAACTATAAAAATAACACATTTGTGCTAGTCCTGGCTGAGTTTGGG
ATTTACAACTGCAAAAAAATTGCAAAGGCACTAATCTCTGATGGAATGCTGAATCATGTGACAGTGAAACCAGAAGTCAT
TATTCGGTCAGTTACTGAAGCTAAAGGCAATCAGAAAGCTGTCCTGGTGTGCAGTGAATATGACTTTTACCCCAAAGCCA
TTAAACTGACGTGGATGAGGAATGATAAAAGGGTTACAGCTGATGTGACGTCCATTGAGGAGATGGCTGATGGAGACTGG
TATTATCAGATTCACTCCCACCTGGAATATTTTCCTCAACCTGGAGAGAAGATCTCCTGTGTGGTGGATCATGCCAGCTT
CCATAAACCCATGATCTATTACTGGGATCCCTCTCTCCCCGAGACTGAAAGATCTAAGATCATTCTTGGGGCTGTGGGGC
TGCTGATGGGGATCTTTACAGCAGCTGCAGGAGTGATCTATTATAAAAGAAATCAAACAGGTTAG
>gi|18858320|ref|NM_131670.1| Danio rerio ATPase, Na+/K+ transporting, beta 3b polypeptide (atp1b3b), mRNA
TGGCAAGCCCGAGCCGACGCTTTCTTTGATTTGTCCTCATCCATCGCTCTCAAACTGGTTTATCTATCCTCTCCACACTA
TGGCCAACAAAGAGGAGAAAGCTGACGAGAAGCAGTCGAGTTGGAAAGATTTTATCTACAACCCGCGGACAGGGGAATTC
ATCGGGCGCACGGCGAGCAGTTGGGCTCTTATATTCCTCTTTTATTTGGTCTTCTATGGCTTTCTGGCGGGAATGTTCAC
GCTTACCATGTGGGTGATGCTACAGACACTGGATGACCATACTCCCAAATACAGGGACCGAGTGGCCAATCCAGGGCTGA
TGATCAGACCAAGGTCCTTGGATATTGCATTTAACCGGTCTATTCCTCAGCAATACAGCAAGTATGTGCAGCATCTGGAG


#################以下、永遠と続く######################


#blastのインストール
$ sudo apt-get install blast

#blast用データベースの構築
$ formatdb -p F -i zebrafish.rna.fna  -n zebrafish

$ ls
formatdb.log  zebrafish.nhr  zebrafish.nin  zebrafish.nsq  zebrafish.rna.fna


それでは,current directoryにpositive control としてtp53のfasta形式のmRNA配列情報をtp53.fastaと名前をつけて置きます。ファイルの中身の一行目の遺伝子情報は、何か適当な文に書き換えて置きます。

いよいよblastの実行です。

$ blastall -p blastn -i tp53.fasta  -d zebrafish -e 1e-90  | less

最後に結果の一部を掲載しておきます。


###############ここから###########################
BLASTN 2.2.21 [Jun-14-2009]


Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.

Query= test
         (2105 letters)

Database: zebrafish
           28,321 sequences; 57,848,304 total letters

Searching..................................................done



                                                                 Score    E
Sequences producing significant alignments:                      (bits) Value

gi|18859502|ref|NM_131327.1| Danio rerio tumor protein p53 (tp53...  2218   0.0 

>gi|18859502|ref|NM_131327.1| Danio rerio tumor protein p53 (tp53),
            mRNA
          Length = 2105

 Score = 2218 bits (1119), Expect = 0.0
 Identities = 1173/1200 (97%)
 Strand = Plus / Plus

                                                                       
Query: 1    gtttagtggagaggaggtcggcaaaatcaattcttgcaaagcaatggcgcaaaacgacag 60
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 1    gtttagtggagaggaggtcggcaaaatcaattcttgcaaagcaatggcgcaaaacgacag 60

                                                                       
Query: 61   ccaagagttcgcggagctctgggagaagaatttgattattcagcccccaggtggtggctc 120
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 61   ccaagagttcgcggagctctgggagaagaatttgattattcagcccccaggtggtggctc 120

                                                                       
Query: 121  ttgctgggacatcattaatgatgaggagtacttgccgggatcgtttgaccccaannnnnn 180


##################以下続く##########################