Electronically inferred events in Arabiodpsis Reactome

Electronic inference from the Arabidopsis event set to a predicted event set in another fully-sequenced plant proceeds in four steps.

1) Protein orthology data were obtained from the OrthoMCL DB, Version 1.4. Briefly, the OrthoMCL clustering procedure, by Li et al. (2003), works as follows: An all-against-all BLASTP is performed on all proteins from Arabidopsis and the other fully-sequenced plant. Reciprocal best similarity pairs between species, and reciprocal better similarity pairs within species (i.e., recently arisen paralogues, proteins that are more similar to each other within one species than to any protein in the other species) are entered into a similarity matrix. The matrix is normalized by species and subjected to Markov clustering to generate orthologue groups including recent paralogues.

2) All Arabidopsis reactions in the Arabidpsis Reactome knowledgebase involving one or more proteins are eligible for electronic inference, with two exceptions. Reactions that were themselves inferred based on data from the other fully-sequenced plant, and reactions involving species in addition to Arabidopsis (e.g., viral infection of Arabiodpsis cells) are excluded from electronic inference. Eligible reactions are checked to determine whether each involved protein has at least one OrthoMCL orthologue or recent paralogue (OP) in the other fully-sequenced plant. If an Arabidopsis reaction involves a complex, at least 75% of the accessioned protein components of the Arabidopsis complex must have OPs in the other fully sequenced plant.

3) For each reaction that meets these criteria, an equivalent reaction is created for the other fully-sequenced plant by replacing each Arabidopsis protein with its other plant OP. If an Arabidopsis protein corresponds to more than one OP from the other fully-sequenced plant, a DefinedSet called 'Homologues of ...' is created, with the other fully-sequenced plant OPs as members. For Arabidopsis proteins that lack a other fully-sequenced plant OP but that are included in complexes inferred due to the 75% threshold rule, placeholder other fully-sequenced plant entities (called 'Ghost homologue of...') are created.

4) If this analysis generates reactions in the other fully-sequenced plant corresponding to any of the steps of a Arabidopsis pathway, then the pathway event is also inferred for the other fully-sequenced plant.

These electronically inferred reactions are predictions based on a number of assumptions. Most basically, we assume that if we can find other fully-sequenced plant OPs corresponding to all proteins involved in a Arabidopsis reaction, then the proteins mediate the same reaction in the other fully-sequenced plant. This may not be true. On the other hand we may miss a truly orthologous reaction in the other fully-sequenced plant because it is mediated by structurally divergent proteins and the OrthoMCL strategy failed to identify them. Similarly, complexes sharing less than 75% orthologous subunits between species may nevertheless continue to perform the same function. The electronically inferred reactions presented in Reactome are thus not data, but hypotheses useful to direct the design of confirmatory experiments.

We infer Arabidopsis reactions to five other plants for which high-quality whole-genome sequence data are available, and hence a comprehensive and high-quality set of protein predictions exists. These species include the rice, the poplar, two grape varieties and a moss. The estimated success rates of our orthology inference strategy can be stated as 'the percentage of eligible reactions, defined in step 2, in the current Arabidopsis data set for which an event can be inferred in the other fully-sequenced plant.