Researchers develop data-mining system for biological literature.
system for scientific literature which implements a unique search
engine using 33 catagories of terms for which a database of
articles and individual sentences can be searched.
Textpresso incorporates a search engine, which enables the user to search one or a combination of tags and/or keywords within a sentence or document, and as the search criteria allows word meaning to be queried, it is possible to formulate semantic queries.
Text-mining tools have become indispensable for the biomedical sciences as the increasing wealth of literature in biology and medicine makes it difficult for the researcher to keep up to date with ongoing research.
This problem is worsened by the fact that researchers in the biomedical sciences are turning their attention form small-scale projects involving only a few genes or proteins to large-scale projects including genome-wide analyses, making it necessary to capture extended biological networks from literature.
The categories are classes of biological concepts (e.g, gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalogue of types of objects and concepts called an ontology.
After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories.
A search engine enables the user to search one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries.
Most information of biological discovery is stored in descriptive full text. Distilling this information from scientific papers manually is expensive and slow, if the full text is available to the researcher at all.
Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies. Searches for two uniquely named genes and an interaction term, the ontology confers a three-fold increase of search efficiency.
Textpresso is currently focusing on Caenorhabditis elegans literature, with 3800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all catagories of the Gene ontology database.
Textpresso can be accessed at http://www.textpresso.org or via WormBase at http://www.wormbase.org
The researchers Hans-Michael Muller, Eimear Kenny and Paul Sternberg said that the future development of Textpresso could be undertaken at many different levels.
They added: "We believe that Textpresso can be extended to achieve information extraction. The wealth of information buried in semantic tag sequences of 1 million sentences asks to be massively exploited by pattern-matching, statistical and machine learning algorithims."
"We have already started to run simple pattern-matching scripts to populate gene-allele associations from the literature for WormBase, as many of them are written in the form "gene name(allele name)," such as "lin3(n1058)"."