Bioinformaticsのお勉強

医療系の仕事をしています。生命の尊さ、美しさがどのようなメカニズムで生じるのかに興味があります。科学の方法論を用いて、このような問いに応えたい、私はこう思って医学生物学の基礎研究のトレーニングを受けてきました。生命を科学的手法を用いて理解を試みる上で、genomeを始めとした種々の大量データの処理が必要不可欠であることを痛感しました。また、生命科学が物理学、数学、統計学、有機化学などの種々の学問と深い関わりを持つことを実感しました。そのため、このブログは広範囲の学問領域に関しての記事を載せています。日々の学習内容を文書に書き残し、それを読み返すことによって、体系化された知識を身に付けることを目標としています。どうぞよろしくお願いします。

open, read, write, close

UNIXのAPIの最も基本的な要素であるopen, read, write, closeを駆使して「cat」コマンドの原型を作成。

/* mycat.c */
/* concatenation */
#include<unistd.h> /* read, write close */
#include<sys/types.h> /* open */
#include<sys/stat.h> /* open */
#include<fcntl.h> /* open */
#include<stdio.h> /* fprint, perror */
#include<stdlib.h> /* exit */

/*ファイルを開いて内部情報を読み込んで、書き出して、最後は閉じる関数のプロトタイプ宣言*/
static void do_cat(const char *path);
/*エラー処理用の関数のプロトタイプ宣言*/
static void die(const char *s);

int main(int argc, char *argv[])
{
int i;

/*引数が一つも無いときは、エラー処理を行う*/
if(argc < 2){
fprintf(stderr, "%s: file name not given\n", argv[0]);
exit(1);
}

for(i=1; i<argc; i++){
do_cat(argv[i]);
}
/*正常終了*/
exit(0);
}

#define BUFFER_SIZE 2048

static void do_cat(const char *path)
{
/*file descriptorを宣言*/
int fd;

unsigned char buf[BUFFER_SIZE];
int n;

/* openは成功すればストリームを作るとともにfile descriptorを返す。*/
/* openは失敗すれば-1を返す*/
/* Read Onlyモードで開く*/
fd = open(path, O_RDONLY);
/*開けなかった時はプログラムを終了*/
if(fd < 0) {die(path);}

for(;;){
/*fd番目のストリームからバイト列を読み込む*/
/*openによる読み込みが問題無く終了したときは0を、エラーが生じたときは-1を返す*/
n = read(fd, buf, sizeof(buf) );
if(n<0){ die(path);}
if(n==0){ break;}
/* writeによりbufsizeバイト分をbufからファイルディスクリプタ番(今回は標準出力*/
/*、すなわちSTDOUT_FILENO番)のストリームに書き込む*/
/* 書き込んだときは書いたバイト数を返す*/
if(write(STDOUT_FILENO, buf, n) < 0){ die(path);}
}
if(close(fd) < 0){ die(path); }
}

/*エラー処理の関数dieの中身*/
static void die(const char *s)
{
/*システムコールが失敗した時はグローバル変数errnoに-1がセットされる*/
/*perrorはこれを読み込んでエラーメッセージを出力する*/
perror(s);
exit(1);
}

/* ここまで */

#コンパイルして実行してみます。
$ gcc -o mycat mycat.c

#テストファイルを二つ作ります。
$ echo "I love you" > test1
$ echo "She loves you" > test2

#実行

$ ./mycat test1
I love you

$ ./mycat test2
She loves you

$ ./mycat test1 test2
I love you
She loves you

$ ./mycat test*
I love you
She loves you

#引数を与えない時のエラー出力のテスト

$ ./mycat

./mycat: file name not given

#存在しないファイル名を与える

$ ./mycat TEST

TEST: No such file or directory

母集団(一標本)の平均収縮期血圧は130mmHgよりも高いか!?

母標準偏差が未知の母集団の平均収縮期血圧が成人の平均血圧

ある地域に居住している20代の男性を無作為に２７人を抽出し、収縮期血圧を測定します。

この標本からこの地域に居住する20代男性の平均収縮期血圧が一般に正常値上限とされる130mmHgを上回っているかを検定したいと思います。

#Rの起動

$ R

#27人分の標本を保持するBP(Blood Pressure）というオブジェクトを生成

BP <-c(134,123,145,134,130,124,156,128,129,

118,120,142,139,127,134,145,132,132,128,139,132,134,145,120,125,129,136)

#scatter plot, dot plot, histogram,boxplotにて可視化を行う

png("120527_blood_pressure.png")

par(mfrow=c(2,2))

plot(BP, main="Blood Pressure", ylab="Blood Pressure(mmHg)", xlab="sample ID")

abline(a=130,b=0)

stripchart(BP, method="stack", pch=1,xlab="Blood pressure(mmHg)", main="Blood pressure")

hist(BP, main="Blood Pressure", xlab="Blood Pressure(mmHg)", ylab="Frequency")

boxplot(BP, main="Blood Pressure", ylab="Blood Pressure(mmHg)")

dev.off()

#ここでt.test関数のヘルプを参照する

help(t.test)

/*以下抜粋*/

Description:

Performs one and two sample t-tests on vectors of data.

Usage:

t.test(x, y = NULL,

alternative = c("two.sided", "less", "greater"),

mu = 0, paired = FALSE, var.equal = FALSE,

conf.level = 0.95, ...)

Arguments:

x: a (non-empty) numeric vector of data values.

y: an optional (non-empty) numeric vector of data values.

alternative: a character string specifying the alternative hypothesis,

must be one of ‘"two.sided"’ (default), ‘"greater"’ or

‘"less"’. You can specify just the initial letter.

mu: a number indicating the true value of the mean (or difference

in means if you are performing a two sample test).

paired: a logical indicating whether you want a paired t-test.

var.equal: a logical variable indicating whether to treat the two

variances as being equal. If ‘TRUE’ then the pooled

variance is used to estimate the variance otherwise the Welch

(or Satterthwaite) approximation to the degrees of freedom is

used.

conf.level: confidence level of the interval.

/*ここまで*/

#1サンプルのt検定を実施する

t.test(BP, mu=130, alternative="greater")

/*結果*/

One Sample t-test

data: BP

t = 1.5133, df = 26, p-value = 0.07114

alternative hypothesis: true mean is greater than 130

95 percent confidence interval:

129.6704 Inf

sample estimates:

mean of x

132.5926

/*以上*/

帰無仮説H0 : μ = 130(母平均は130である）

対立仮説H1 : μ > 130（母平均は130よも大きい）

有意水準 : α = 0.05

として、検定を行った。

その結果

p-value=0.07114　> 　有意水準　= 0.05

となり、帰無仮説H0は棄却されず、母平均は130よりも大きいとは言えないとされる。（かといって、母平均が130であるとは積極的に肯定もされない）

信頼区間は

95 percent confidence interval:　 129.6704 Inf

ということである。

この意味は、

母平均の95%信頼区間が下限129.6704, 上限が無限大であるということである。

両側検定をしてみると

t.test(BP, mu=130, alternative="two.sided")

One Sample t-test

data: BP

t = 1.5133, df = 26, p-value = 0.1423

alternative hypothesis: true mean is not equal to 130

95 percent confidence interval:

129.0710 136.1142

sample estimates:

mean of x

132.5926

同じくp値は0.05以上となるため、母平均が130よりも低いとも高いとも言えないという結論となる。

95%信頼区間は

129.0710から136.1142の間となります。

なお、t = 1.5133の導出についてですが、

以下の数式でt値は計算されます。

xバー　：　標本平均

μ0 　：　比較したい特定の値

s　　　：　標準偏差(不偏分散から計算しています)

n　　 : 　標本のサイズ

t値を定義に基づいて計算すると以下のようなRのスクリプトになります。

t.value <- (mean(BP)-130 ) / ( sd(BP) / sqrt(length(BP)) )

t.value

[1] 1.513264

自由度length(BP) - 1 = 26のt分布において、

t値が1.513263を超える確率(すなわちこれがp値)は以下のように計算される。

(1-pt(1.513264, length(BP)-1))*2

[1] 0.1422746

t.test関数で算出したものと一致します。

ブログにアップした数式はWeb Equationを用いて手書き入力からTexのコードを起こして

CODECOGSでgifファイルにしました。

Web Equation(http://webdemo.visionobjects.com/equation.html)

CODECOGS(http://www.codecogs.com/latex/eqneditor.php)

文字列リテラルはポインタ！？

"abcd\0"は先頭のaという文字のアドレスを格納するポインタであるようです。

以下に、文字列リテラルとポインタの関係についてねちっこく対比したプログラムを載せます。

/* pointer_study.c */

#include<stdio.h> /* printf */

int main(void)

{

char *ptr;

//ptrに"abcd"のアドレスを代入

ptr ="abcd\0";

//以下の一行はバグになる

//*ptr ="abcd";

//ポインタ"abcd\0"が格納するアドレスを表示

printf("%p\n", "abcd\0");

//ポインタptrが格納するアドレスを表示

printf("%p\n\n", ptr);

//ポインタ"abcd\0"が参照する文字列リテラルを表示

printf("%s\n", "abcd\0");

//ポインタptrが参照する文字列リテラルを表示

printf("%s\n\n", ptr);

/* 以下はコンパイルエラーになる

//ポインタ"abcd\0"が参照する文字列リテラルを一つずつ表示

while(*"abcd" != '\0'){

printf("%c", *"abcd\0");

"abcd\0"++;

}

printf("%s\n", ptr);

//カウンタiを用いればエラーにはならない。

int i=0;

while(*("abcd\0"+i) != '\0'){

printf("%c", *("abcd\0"+i));

i++;

}

printf("%c\n",*("abcd\0"+i));

//ポインタptrが参照する文字列リテラルを一つずつ表示

while(*ptr != '\0'){

printf("%c", *ptr);

ptr++;

}

printf("%c\n\n", *ptr);

//カウンタiを用いて、ポインタ"abcd\0"が格納するアドレスを一つずつ表示する

i=0;

while(*("abcd\0"+i) != '\0'){

printf("%p\n", "abcd\0"+i);

i++;

}

printf("%p\n\n","abcd\0"+i);

//ポインタptrが参照する文字列リテラルを一つずつ表示

ptr ="abcd\0";

while(*ptr != '\0'){

printf("%p\n", ptr);

ptr++;

}

printf("%p\n\n", ptr);

return 0;

}

/* ここまえ */

#コンパイルして実行！！

$ ./pointer_study

0x8048680

abcd

0x8048680

0x8048681

0x8048682

0x8048683

0x8048684

0x8048680

0x8048681

0x8048682

0x8048683

0x8048684

main関数の引数

ANSI Cの規格ではmain関数の引数の数は０個か２個のどちらかです。
前者の場合、
int main(void);
後者の場合、
int main(int argc, char *argv[]);
のようにプロトタイプ宣言されます。
argcとargvは
argc : argument count
argv : argument vector
の略称だそうです。
argcはプログラム実行時に渡される引数の数を格納しています。
argvはプログラム実行時に渡される引数の文字列を格納しているポインタ配列です。

以下に、argv[]が参照する文字列（各引数）を表示させるプログラムを示します。

/* argc_argv.c */

#include<stdio.h> /* printf */

/* argc = argument count, argv = argument vector */
int main(int argc, char *argv[])
{
int i;
for(i=0;i<argc;i++){
/* argv[]が参照する文字列を１つずつ表示する。*/
printf("argv[%d] = %s\n", i, argv[i]);
}
return 0;
}

/* 終わり */

#実行例

$ ./argc_argv We are the world
argv[0] = ./argc_argv
argv[1] = We
argv[2] = are
argv[3] = the
argv[4] = world

argcとargvは仮引数なので実際には任意の文字を当てることが可能です。
以下に、argc -> a , argv -> bと書き換えた例を示します。

/* argc_argv2.c */
#include<stdio.h> /* printf */

/* argc = argument count, argv = argument vector */
int main(int a, char *b[])
{
int i;
for(i=0;i<a;i++){
/* argv[]が参照する文字列を１つずつ表示する。*/
printf("b[%d] = %s\n", i, b[i]);
}
return 0;
}

#実行例

$ ./argc_argv2 Come Together
b[0] = ./argc_argv2
b[1] = Come
b[2] = Together

locateコマンドを用いてファイルを高速検索

locateコマンドはファイルシステムのデータベースを検索することで高速で目的のファイルの絶対パスを探し出すことができるコマンドです。

$ man -a locate

#################################################################

locate(1) locate(1)

NAME
locate - find files by name

SYNOPSIS
locate [OPTION]... PATTERN...

DESCRIPTION
locate reads one or more databases prepared by updatedb(8) and writes file names matching at least
one of the PATTERNs to standard output, one per line.

If --regex is not specified, PATTERNs can contain globbing characters. If any PATTERN contains no
globbing characters, locate behaves as if the pattern were *PATTERN*.

By default, locate does not check whether files found in database still exist. locate can never
report files created after the most recent update of the relevant database.

#################################################################

#以前どこかに置いたcoreutils-8.4.tar.gzを探す
$ locate coreutils-8.4.tar.gz
/home/kappa/Documents/1203/0301/coreutils-8.5/lib/coreutils-8.4.tar.gz

#以前どこかに置いたcoreutils-8.4.taとしても*を補って探し出してくれる
$ locate coreutils-8.4.ta
/home/kappa/Documents/1203/0301/coreutils-8.5/lib/coreutils-8.4.tar.gz

#こちらがワイルドカードを補うのは余計なお節介のようである。

$ locate coreutils-8.4.ta*

$

locateコマンドは高速で便利である。

Bioinformaticsのお勉強

open, read, write, close

母集団(一標本)の平均収縮期血圧は130mmHgよりも高いか!?

文字列リテラルはポインタ！？

main関数の引数

locateコマンドを用いてファイルを高速検索

自己紹介

過去のブログ♪♪♪