机器学习和生物信息学实验室联盟

标题: GFF3注释文件 [打印本页]

作者: zouquan 时间: 2012-4-23 09:05
标题: GFF3注释文件
GFF3是GFF注释文件的新标准。文件中每一行为基因组的一个属性，分为9列，以TAB分开。依次是：
1. reference sequence：参照序列指出注释的对象。如一个染色体，克隆或片段。可以有多个参照序列。
2. source ：来源注释的来源。如果未知，则用点（.）代替。
3. type ：类型属性的类型。建议使用符合SO惯例的名称（sequence ontology，参看[[Sequence Ontology Project]]) ,如gene，repeat_region，exon，CDS等。
4. start position ：起点属性对应片段的起点。从1开始计数。
5. end position ：终点属性对应片段的终点。一般比起点的数值要大。
6. score ：得分对于一些可以量化的属性，可以在此设置一个数值以表示程度的不同。如果为空，用点（.）代替。
7. strand ：链 “＋”表示正链，“－”表示负链，“.”表示不需要指定正负链。
8. phase ：步进对于编码蛋白质的CDS来说，本列指定下一个密码子开始的位置。可以是0，1或2，表示到达下一个密码子需要跳过的碱基个数。对于其它属性，则用点（.）代替。
9. attributes ：属性一个包含众多属性的列表。格式为“标签＝值”（tag=value）。不同属性之间以分号相隔。可以存在空格，不过若有“,=;”则用URL转义（URL escaping rule），同时TAB也需要转换为“%09”表示。下列的标签已定义： ID 指定一个唯一的标识。对属性分类是非常好用（例如查找一个转录单位中所以的外显子）。 Name 指定属性的名称。展示给用户的就是该属性。 Alias 名称的代称或其它。当存在其它名称时使用该属性。

来源：http://bio-spring.info/wp/?tag=gff3
参考：http://www.sequenceontology.org/gff3.shtml

作者: zouquan 时间: 2012-8-27 15:57
如何将BLAST的结果转化为gff3格式呢？

用以下perl脚本[attach]964[/attach]

注意，blast的结果要-m 8输出

然后如下图运行即可
[attach]965[/attach]

作者: cwc 时间: 2013-3-19 14:56
本帖最后由 cwc 于 2013-6-18 15:59 编辑

如果是blast+的版本，输入的参数则为-outfmt 6或者7，也就是输出tabular格式。

作者: zouquan 时间: 2013-3-27 07:21
http://www.sequenceontology.org/

请大家关注这个网站

作者: xmubingo 时间: 2013-6-18 14:18
本帖最后由 xmubingo 于 2013-6-18 14:26 编辑

转换GFF工具–大汇总
http://boyun.sh.cn/bio/?p=1827

汇总，将各种格式转换为GFF格式的脚本。这些脚本分散在不同的软件包中，可以根据需要下载使用。

bioPerl

search2gff             This script will turn a protein Search report (BLASTP, FASTP, SSEARCH, AXT, WABA) into a GFF File.
genbank2gff3.pl    — Genbank->gbrowse-friendly GFF3
gff2ps                   This script provides GFF to postscript handling.
gbrowse

ucsc_genes2gff Convert UCSC Genome Browser-format gene files into GFF files suitable for loading into gbrowse
http://search.cpan.org/~lds/GBrowse-2.39/bin/bed2gff3.pl
blast92gff3.pl BLAST tabular output (-m 9 or  conversion to GFF version 3 format,
http://eugenes.org:7072/gmod/genogrid/scripts/
DAWGPAWS

http://dawgpaws.sourceforge.net/man/cnv_blast2gff.html

cnv_blast2gff.pl This program will translate a blast report for a single query sequence into the GFF format.
ubuntu

sim2gff
ali2gff
blat2gff
gff2aplot
parseblast
Tandy software

http://eugenes.org/gmod/tandy/； http://iubio.bio.indiana.edu:7122/gmod/tandy/

gff2aplot  — a program to visualize the alignment of two genomic sequences together with their annotations. From GFF-format input files it produces PostScript figures for that alignment.
blat2gff Converts BLAT output files to GFF formatted files，
blat2gff < inputfile > outputfile

BioWiki中还有一篇，总结更多GFF工具的文章，请参看下面链接：
http://biowiki.org/GffTools

GFF3
英文解释：http://gmod.org/wiki/GFF3

中文解释：http://bio-spring.info/wp/?tag=gff3

作者: hsc 时间: 2013-9-11 21:23
对于将blast的xml格式的输出转化成gff3的，可以先将其转化成tab格式的，然后使用blast92gff3.pl转化成gff3格式的。xml--->tab的可以使用一个开源工具biopython，Bio/Blast/NCBIXML.py可以识别xml文档，然后自己写程序转化一下就可以了。参考程序：

import sys
from Bio.Blast import NCBIXML
file_handle = open(sys.argv[1])
blast_records = NCBIXML.parse(file_handle)
#
for record in blast_records:
#no match
if(len(record.alignments) == 0):
continue
#query_id
#print 'query id:', record.query_id
#hit_id
for align in record.alignments:
#print 'hit id:', align.hit_id
# %identities
for hsp in align.hsps:
#output all value
print "%s\t%s\t%f\t%s\t%d\t%s\t%s\t%s\t%s\t%s\t%s\t%s" %(record.query_id, align.hit_id, (hsp.identities*1.0/hsp.align_length*100.0),
hsp.align_length, (hsp.align_length-hsp.identities), hsp.gaps, hsp.query_start, hsp.query_end, hsp.sbjct_start, hsp.sbjct_end, hsp.expect, hsp.bits)

复制代码

当然这段程序需要安装biopython，有一个说明文档[attach]1618[/attach]

欢迎光临机器学习和生物信息学实验室联盟 (http://123.57.240.48/)

Powered by Discuz! X3.2