序列比对工具之BLAST

xmubingo · 发表于 2011-3-2 19:41:53

BLAST是基本局部比对搜索工具的缩写。它的功能是对生物不同蛋白质的氨基酸序列或者不同基因的DNA序列进行比对。在相应数据库中进行序列相似性搜索，寻找相同或者相似序列。

ncbi在线服务网址：http://blast.ncbi.nlm.nih.gov/Blast.cgi

xmubingo · 发表于 2011-9-1 10:07:06

blast+是blast的新版本，参数和blast也相差很大。另附网络上一个中文解释。

xmubingo · 发表于 2011-9-12 15:01:57

blast+的一个重要特色是可以自定义输出格式：

*** Formatting options
-outfmt <String>
alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = XML Blast output,
6 = tabular,
7 = tabular with comment lines,
8 = Text ASN.1,
9 = Binary ASN.1,
10 = Comma-separated values,
11 = BLAST archive format (ASN.1)
Options 6, 7, and 10 can be additionally configured to produce
a custom format specified by space delimited format specifiers.
The supported format specifiers are:
qseqid means Query Seq-id
qgi means Query GI
qacc means Query accesion
qaccver means Query accesion.version
qlen means Query sequence length
sseqid means Subject Seq-id
sallseqid means All subject Seq-id(s), separated by a ';'
sgi means Subject GI
sallgi means All subject GIs
sacc means Subject accession
saccver means Subject accession.version
sallacc means All subject accessions
slen means Subject sequence length
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned part of query sequence
sseq means Aligned part of subject sequence
evalue means Expect value
bitscore means Bit score
score means Raw score
length means Alignment length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive-scoring matches
gapopen means Number of gap openings
gaps means Total number of gaps
ppos means Percentage of positive-scoring matches
frames means Query and subject frames separated by a '/'
qframe means Query frame
sframe means Subject frame
btop means Blast traceback operations (BTOP)
When not provided, the default value is:
'qseqid sseqid pident length mismatch gapopen qstart qend sstart send
evalue bitscore', which is equivalent to the keyword 'std'
Default = `0'

复制代码

2.2.24中没有qlen和slen这两个候选项............

zouquan · 发表于 2011-10-13 19:20:24

本地BLAST使用方法：

注意：读入读出文件的名字中不能有空格！

1. 运行cmd，在dos状态下进入到本文件夹的bin文件夹下面

2. 将查询序列文件（比如me.fasta）和数据库文件（比如human.fasta）文件也放入bin文件夹里面。（注意：要使用fasta格式编辑序列）

3. 预处理数据库文件。（这步是必需的，懂得blast工作原理的人应该知道，这步在为数据库中的序列做hash表/索引表，以保证下步搜索能够快速完成，不用遍历数据库）预处理使用formatdb命令。具体格式如下：

formatdb -i ecoli.nt -p F -o T

其中-i 输入需要格式化的源数据库名称

-p 文件类型，是核苷酸序列数据库，还是蛋白质序列数据库

T – protein F - nucleotide [T/F] Optional

default = T

-a 输入数据库的格式是ASN.1（否则是FASTA）

T - True, F - False. [T/F] Optional

default = F

-o 解析选项

T - True: 解析序列标识并且建立目录

F - False: 与上相反

[T/F] Optional default = F

该步成功后，文件夹内会多出7个文件。

4. 开始比对时，在命令行键入下列命令：
blastall -p blastn -i myRNA.fasta -d humanRNA.fasta -o myresult.blastout -a 2 -F F -T T -e 1e-10 （注意是e，不是E）
解释如下：
blastall: 这是本地化/命令行执行blast时的程序名字！(Tips:blastall直接回车就会给出你所有的参数帮助，但是英文的)
-p: p 是program的简写,program在计算机领域中是程序的意思。此参数是指定要使用何种子程序，所谓子程序，就是针对不同的需要，如核酸序列和核酸序列进行比对、蛋白质序列和蛋白质序列进行比对、假设翻译后核酸序列于蛋白质序列进行比对，选择相应的子程序: blastn 是用于核酸对核酸 blastp 是蛋白质对蛋白质序列等等，一共5个自程序。在核酸序列中搜索蛋白序列是tblastn,在蛋白序列中搜索核酸序列是blastx.
-i: i 是input的简写，意思是输入文件，就是你自己的要进行比对的序列文件(fasta格式）
-d: d是database的简写,意思是要比对的目标数据库,在例子中就是humanRNA.fasta
-o: o是output的简写，意思是结果文件名字，这个根据你自己的习惯起名字，可以带路径，(上边两个参数-i -d 也都可以带路径)

*注意以上4个参数是必须的，缺一不可，下面的参数是为了得到更好的结果自己可调的参数，如果你不加也没有关系，blastall程序本身会给一个默认值！
-a: 是指计算时要用的CPU个数，我的机器有两个CPU，所以用-a 2，这样可以并行化进行计算，提高速度，当然你的计算机就一个CPU,可以不用这个参数，系统默认值为1,就是一个CPU
-F: 是filter的简写，blastall程序中有对简单的重复序列和低复杂度的一些repeats过滤调，默认是T (注意以后的有几种参数就两个选项，T/F T就是ture,真，你可以理解为打开该功能; F就是false，假，理解为关闭该功能)
-T: 是HTML的简写，是指blast结果文件是否用HTML格式，默认是F!如果你想用IE看，我建议用-T T
-e: 是Expectation value，期望值，默认是10，我用的10-10！

还可以 -m 8 （是表格输出）

eg： blastall -p blastn -i myRNA.fasta -d humanRNA.fasta -o myresult.blastout -m 8

zouquan · 发表于 2011-10-13 22:17:57

本地BLAST还有一个子软件BLASTClust，可以实现对序列聚类

BLASTClust accepts a number of parameters that can be used to control the stringency of clustering including thresholds for score density, percent identity, and alignment length. The BLASTClust program has a number of applications, the simplest of which is to create a non-redundant set of sequences from a source database. As an example, one might have a library of a few thousand short nucleotide sequence reads and wish to replace these with a non-redundant set. To produce the non-redundant set, one might use:

blastclust -i infile -o outfile -p F -L .9 -b T -S 95

The sequences in "infile" will be clustered and the results will be written to "outfile". The input sequences are identified as nucleotide (-p F); "-p T", or protein, is the default. To register a pairwise match two sequences will need to be 95% identical (-S 95) over an area covering 90% of the length (-L .9) of each sequence (-b T) . Using "-b F" instead of "-b T" would enforce the alignment length threshold on only one member of a sequence pair. The parameter "S", used here to specify the percent identity, can also be used to specify, instead, a "score density." The latter is equivalent to the BLAST score divided by the alignment length. If "S" is given as a number between 0 and 3, it is interpreted as a score density threshold; otherwise it is interpreted as a percent identity threshold.

To create a stringent non-redundant protein sequence set, use the following command line:

blastclust -i infile -o outfile -p T -L 1 -b T -S 100

In this case, only sequences which are identical will be clustered together. The “blastclust.txt” file in the standalone BLAST package details the full range of BLASTClust parameters.

如果要帮助的话

Blastclust h

注意：不是-h，也不是单单的blastclust

zouquan · 发表于 2012-8-26 16:18:12

Linux下BLAST的使用方法

1. 开通服务器帐号；（联系邹老师）

2. 安装两个软件 winscp和putty.

3. 用winscp登录，如下图

4. 登陆上去，如下图：左面是你的电脑，右面是服务器

现将BLAST的压缩包上传，并解压缩（命令：tar zxvf blast2.2.20.tar.gz）

进入blast/bin目录(命令：cd blast/bin)

将要查询的fasta格式序列（比如：unigene数据），和数据库文件（比如：基因组数据）上传到服务器上，从左向右拖拽即可。（注意：查询文件和数据库文件都要求是fasta格式）

5.点击下图位置进入putty，

输入用户名密码后，就到了用户名当前目录下，利用cd进入到bin目录下

如果不知道自己在哪，可以敲ls，查看当前目录下的文件。具体参考linux常见shell命令

6. 敲ls，确定刚才上传的两个文件，（比如unigene.fa和genome.fa）在当前目录下。

7. 两个命令完成blast：

7.1 为数据库建立索引，命令为

./formatdb -i genome.fa -p F （敲回车）

注意：-i 后面跟数据库文件名

如果是dna/rna文件要敲-p F；如果是蛋白文件，敲-p T

7.2 对数据库搜索，命令为

./blastall -p blastn -i unigene.fa -d genome.fa -o output.txt -m 8 -e 1e-10

注意：-p blastn是说对DNA数据库搜索DNA，如果对蛋白搜索蛋白用-p blastp等

-i 后面跟查询文件的名字

-d 后面跟数据库文件的名字

-o 后面跟输出文件的名字

-m 8 是以表格形式输出，如果不写-m 8则以常见的配对格式输出

-e 后面跟E-value的限制，越大输出结果越多，越不严格。

运行结束后，刷新winscp服务器端，应该有一个新的文件output.txt，下载下来查看即可(从右向左拖拽)

BTW：如果要在nr中BLAST，时间会非常长。我已建好索引，直接联系我即可，就不用7.1那步了。

任何一步有问题可以联系我。

邹权(zouquan@xmu.edu.cn)

zouquan · 发表于 2013-3-24 09:38:31

http://blog.163.com/yinjianrui19 ... 919720113634752102/

zouquan · 发表于 2013-4-17 16:32:25

nt（核酸）和nr（蛋白）是NCBI所有的核酸/蛋白库，有时候想查查自己的序列是否被别人发表过，可以到nt或nr中查询。

我已在70服务器上建好了nt和nr和refseq genomic数据库，地址在/backup2/blastdb/nt或nr目录下

可以直接运行命令，比如：
#blastn -query query.fa -db /backup2/blastdb/nt/nt -out query.blast6 -outfmt 6 -num_threads 4

#blastx -query query.fa -db /home/blastdb/nr/nr -out query.blast6 -outfmt 6 -num_threads 4

# blastn -query scaffold.fa -db /home/blastdb/nt/nt -out scaffold.blast7 -num_threads 4 -outfmt "7 qseqid qlen sseqid slen qstart qend sstart send evalue length"

zouquan · 发表于 2013-4-26 13:34:13

利用BLAST+为蛋白序列计算PSSM矩阵
http://datamining.xmu.edu.cn/bbs ... wthread&tid=896

zouquan · 发表于 2013-4-29 21:29:35

蛋白质blast+运行一例：
#makeblastdb -dbtype prot -in 16245.fasta
# blastp -query 4151.fasta -db 16245.fasta -out zouout.blast6 -outfmt "6 qseqid qlen sseqid slen qstart qend sstart send evalue length"

PSIBlast

#makeblastdb -in prefix_TATA90.fasta -dbtype prot
#psiblast -query prefix_TATA90.fasta -db prefix_TATA90.fasta -out zz.out -evalue 10 -outfmt 6

		自动登录	找回密码
密码			注册

序列比对工具之BLAST

blast参数手册

本帖子中包含更多资源

blast+2.2.24和2.2.25输出格式差别

本帖子中包含更多资源

本帖子中包含更多资源