2.2.2 PseAAC-Builder, [attach]1586[/attach], 放到服务器上解压缩后,直接到prebin文件夹下,将蛋白质的fasta文件考入,例如2.fasta [attach]1587[/attach](注意格式:fasta的标题行应该是>Q9ZW3|1这样的格式,其中1是类别标签。INPUT_FILE should be a valid FASTA format file. The additional restriction is in the annotation line, which was always started by ">". Every annotation line in the input FASTA file must contain two and only two fields, which are separated by "|". The first field is the id of the sequences, which can be gi numbers, uniprot id, accession numbers or anything that can identify the sequence uniquely in the FASTA file. The second field is an integer, which is recognized as the class label when you export the result. If you do not need a class label, simply put a "0" in this field.)
注意:本代码的ROC计算方法没有考虑到多个点的预测值相同时的情况,因此对于预测精度不高的分类器(比如10棵树的RF),数值会与weka的计算结果有较大偏差,以weka为准。
The ROC50 score is the area under the ROC curve, up to the fi rst 50 false positives作者: Snow_Bubble 时间: 2013-8-2 21:30
复旦董启文的方法:
在PSSM矩阵的基础上,分AC和ACC两种方法。
AC方法测量同一属性的相关性。即在PSSM矩阵中,每行上每个数和相隔距离为1、2、……、LG的两个数计算相关性,靠近两边的做一定特殊处理,产生的维数是20*LG。
ACC方法测量不同属性的相关性,产生的维数是380*LG,再加上AC法的20*LG维,最后得到的是400*LG维。
董启文的代码在不同的数据集上做实验得到结果是LG取8-10为佳,最好取10,不过这有可能会根据数据集、分类器等其他因素改变。
董启文的代码网址:http://www.iipl.fudan.edu.cn/demo/accpkg.html。里面有readme,不过是英文的,下面做一点使用说明。
将main.cpp文件,用VS2008或类似软件打开。以08为例,新建Win32项目,在源文件中导入已有项即main.cpp,运行生成可执行文件,在工程的Debug文件夹中可以找到。重命名为AC.exe。
如果要做ACC方法,则要先再源代码中去掉#define AC这一句,生成的文件可以命名为ACC.exe。
新建一个JAVA工程,做以下几步:
1、写bat文件,因为PSSM矩阵每一个序列生成一个文件,会很多。PSSM矩阵生成参见2.1。bat文件内容为:
AC.exe LG pssm矩阵文件名 out文件名
或
ACC.exe LG pssm矩阵文件名 out文件名
****建议****:矩阵文件名和out文件名都带上一层文件路径,多建一个文件夹,否则会很乱。
2、把AC.exe或ACC.exe和所有PSSM矩阵放在这个工程里面,运行所有的bat文件,每个文件得到一条属性。
3、合并得到的文件夹中所有个文件,得到特征输出。
4、上一步做完后可以多写一个方法来删除那些很大的文件夹,也可以不删。
5、参照按其他方法得到的arff文件(不包含文件头也就是@部分),来写每一条属性的class。
6、最后再加上arff文件头后就能成为标准arff文件了。作者: zouquan 时间: 2013-8-11 23:51
一般说来,在论文实验部分,除了交叉验证论文提出的分类方法有效之外,还应该做一些更深入的分析。参考下面一个期刊( Protein & Peptide Letters, PPL)主编给客座编辑的信。
This higher standard is the expectation that such papers will contain a correlation of computational predictions with observations in experimental studies of proteins. I'm sure you will agree that for a computational method to be considered valid it should be able to demonstrate that it works for a known protein. For example, if you use a set of N known proteins to derive information about where phosphate is likely to be added and develop a program to predict the same property for unknown proteins, the program should first be shown to make correct predictions for several other examples of structurally known proteins that were not part of the first set from which the rules were developed. Unfortunately, I have seen many manuscripts submitted to PPL in the past few years that leave out this important correlation and have had to reject them with advice to the authors to consider revising and re-submitting with the inclusion of the additional information. I request that you include some discussion of this in the letter that you sent to all potential authors so that they prepare their manuscripts while following this expectation.作者: RockRabbit 时间: 2013-8-27 21:55
由于在蛋白质分类问题中,进行特征提取时时常用到二级结构特征,我在这边介绍一种常用的二级结构软件PSIPRED
的安装方法,希望对同学们有用。