Abstract: The presently available gene annotation approaches are based on features that are unavailable in short reading sequences generated from next generation sequencing, which results in substandard performance for metagenomic samples. Innovative programs have been developed that enhance performance in undersized reading sequences. The main deficits are in the current prediction algorithm’s ability in predicting non-coding regions and gene limits, which gives rise to more false-positives and false-negatives than expected. Uniting these program’s predictions, a significant amount of improvement in specificity can be obtained at minimal cost to sensitivity, also these new approaches can be applied to real dataset to demonstrate the use of these programs. Here the plan is to benchmark ten metagenomic gene prediction programs and combine their predictions to improve metagenomic read gene annotation.
Goal: Performance optimization and benchmarking of presently available metagenomic gene prediction programs for enhancing both prediction and annotation accuracies. The plan is to analyze the ten leading metagenomic gene predictors such as Orphelia, MGA (Meta Gene Annotator), GENEMark, FragGeneScan, Glimmer-MG, MetaGUN, HMM Gene, Prodigal, Phymm and FGENESH with respect to their sensitivity and specificity. To do so, a simulation would be performed on an artificial dataset composed of some coding part, noncoding, and incompletely coding metagenomic reads. Then several metrics would be introduced in order to compare the predictors. And then different combinations of the prediction programs would be used in order to combine their predictions to improve accuracy.
Metagenomic analysis can be defined as the classification of microbial genomes through the direct isolation of genomic sequences from the environment without former cultivation . Samples from an environment are sequenced using next-generation sequencing (NGS), also known as high-throughput sequencing technologies which yields short read length sequences . Accurate gene annotation for environmental samples is necessary so that correct functional classification of genes could be done, and it overlays a path for efficient functional studies in metagenomics.
Presently available gene predictors can be characterized in two different groups. The first group consists of the ab initio predictors, which train model parameters on already known annotations which then predict the unknown annotations, are widely used in gene prediction . Currently there are many ab-initio gene-finding programs, e.g.: GLIMMER , and GeneMark . The second group of gene prediction programs, homology-based programs, which predict genes by aligning input sequences to the closest homologous sequence in the database. Some popular programs of this type are GENEWISE , AGenDA .
It is hardly possible to use these mentioned traditional gene prediction methods in metagenomics. These are very conventional approaches and are restricted by the recognition of Open Reading Frames, which begin with a start codon and end with an in-frame stop codon . But metagenomic reads are very short in length, which contain incomplete ORFs that do not have start/stop codons, thus these conventional ab-initio programs cannot be applied to metagenomics . Similarly, homology-based approaches for gene predictions are heavily databases dependent which contain known sequences only, and thus a limited set, of genes. Therefore, some modern tools have emerged to address these problems for metagenomic reads. Programs which are widely used for this purpose are Orphelia , Meta Gene Annotator (MGA) , GeneMark , FragGeneScan , Glimmer-MG , MetaGUN , HMM Gene , Prodigal , Phymm  and FGENESH .
Summary of these commonly used Metagenomic Gene Prediction Programs:
Orphelia: It is a metagenomic gene prediction tool for short, environmental DNA sequences with unknown phylogenetic origin . Orphelia is constructed on a two-stage machine learning approach. An artificial neural network combines the features and calculates the probability for each ORF in a fragment. A greedy strategy computes a probable combination of high scoring ORFs with an overlap restraint. 
MetaGeneAnnotator: Metagene Annotator (MGA) is an upgraded version of another software package, called MetaGene (MG) which is used in gene prediction in metagenomic
sequence data. MetaGene predicts genes in two stages. First, all probable ORFs are discovered from the input sequences and nextly scoring of all ORFs is done by their base compositions and lengths using a scoring scheme. 
GeneMark: It utilizes a heuristic approach that constructs a set of Markov models using a marginal amount of sequence information. This approach is used to find genes in small fragments of anonymous prokaryotic genomes and in the genomes of viruses, phages etc. and in highly inhomogeneous genomes where modification of models according local DNA composition is needed. 
FragGeneScan: It is a hidden Markov model (HMM)-based predictor of incomplete and complete genes from short reads or complete genomes. It uses error models and codon usage in HMM based methodologies. 
Glimmer-MG: It is based on predictive models inside each cluster of the initial gene predictions before making a final set of predictions. To account for fragmented genes, it incorporates a model for gene length, in which partial genes are cautiously handled. 
MetaGUN: It predicts gene from metagenomic fragments based on a machine learning approach of SVM (Support Vector Machine) . Firstly, it classifies input fragments into phylogenetic groups by a k-mer based method. Then, the coding sequences are recognized for each group independently with classifiers based on this SVM approach. Finally, the TISs (Translation initiation sites) are adjusted by employing a modified version of MetaTISA. 
HMM Gene: The program is based on a Hidden Markov models, which are probabilistic models. HMMgene can also show the N best gene predictions for a sequence. It is useful if there are several equally probable gene structures and it may even specify alternative splicing. 
Prodigal (Metagenomic version): It stands for Prokaryotic Dynamic Programming Genefinding Algorithm. It uses a “trial and error” method. It builds a set of curated genomes which is already analyzed using the JGI ORNL pipeline . This pipeline consisted of a combination of Critica and Glimmer  to locate missing genes and to correct errors, & a round of manual curation of the genome sets. 
Phymm/PhymmBL: It uses interpolated Markov models (IMMs) to taxonomically classify DNA sequences, can exactly classify short reads close to 100 bp. PhymmBL, the hybrid classifier included in this tool distribution which gives a combined analysis using both Phymm and BLAST, gives highly accurate results. 
FGENESH: It is based on HMM models which uses similarity information FGENESH+ (similar protein), FGENESH_C (similar cDNA), FEGENESH-2 (homologous genomic sequences) for greatly improved accuracy of gene prediction when similarity information is accessible. 
To improve gene prediction and annotation in metagenomes, the new approach is to first analyze the ten leading metagenomic gene prediction programs with respect to their sensitivity and specificity to predict whether a read contains a gene. And then analysis of different ways of combining the prediction programs to improve sensitivity, specificity, prediction and annotation accuracy.
To combine these methods, a statistical method of Receiver Operating Curves (ROC) would be used. ROC analysis uses tools to select probable optimal models and to reject suboptimal ones independently from the context. To evaluate which methods or their combination is the best independent of read length, the read length would be varied and the receiver-operating characteristic (ROC) curves would be evaluated for each one of them.
Then a simulation of an artificial dataset composed of several coding and non-coding regions, and partially-coding metagenomic reads. This dataset can be created from available microbiome entries in databases, e.g. Human endogenous intestinal microflora .
Then several metrics would be used to compare the predictors and the parameters for the programs. Analysis of these predictions would be done using the sensitivity, specificity, harmonic-mean (f-measure) measure, prediction accuracy and Annotation Error.