interlinkinc.net

Interlinkinc.net

Predication-based Semantic Indexing:
Permutations as a Means to Encode Predications in Semantic Space
Trevor Cohen, MBChB, PhDa, Roger W. Schvaneveldt, PhDb, Thomas C. Rindflesch, PhDc
aCenter for Decision Making and Cognition, Department of Biomedical Informatics, Arizona
State University, Phoenix Arizona
bDepartment of Applied Psychology, Arizona State University
cNational Library of Medicine, Bethesda, Maryland
Abstract
Background
Corpus-derived distributional models of semantic Many existing distributional models draw estimates of distance between terms have proved useful in a number semantic relatedness from co-occurrence statistics within of applications. For both theoretical and practical a defined context such as a sliding window or an entire reasons, it is desirable to extend these models to encode document (1). Recent models (reviewed in (6)) instead discrete concepts and the ways in which they are related define as a context a grammatical relationship produced to one another. In this paper, we present a novel vector by a parser, but do not encode the nature of this space model that encodes semantic predications derived relationship in a retrievable manner. Distributional from MEDLINE by the SemRep system into a compact models that encode word order using either convolution spatial representation. The associations captured by this products (7) or permutation of sparse random vectors method are of a different and complementary nature to (8) transform vectors representing terms into new those derived by traditional vector space models, and the representations close-to-orthogonal to the original encoding of predication types presents new possibilities vectors. Consequently there is minimal overlap in the for knowledge discovery and information retrieval. information they carry, and additional information related to term position can be encoded. These transformations Introduction
are reversible, to facilitate retrieval of this information. The biomedical literature contains vast amounts of knowledge that could inform our understanding of PSI is based on Sahlgren et al's model which uses human health and disease. Much of this literature is permutations as a means to encode word order available as electronic text, presenting an opportunity for information (8), which in turn is a variant of the Random the development of automated methods to extract and Indexing (RI) model (9). Sahlgren et al's approach encode knowledge in computer-interpretable form. provides a simple and elegant solution to the problem of Distributional models of language are able to extract reversibly transforming term vectors using permutations meaningful estimates of the semantic relatedness of the sparse random vectors which form the basis of RI. between terms from unannotated free text. These models The approach is derived from sliding-window (or term- have proved useful in a variety of biomedical term) RI, derives vector representations for terms from applications (for a review see (1)), and include recent their co-occurrence with other terms in a sliding window variants that scale comfortably to large biomedical moved through the text. While the sliding window corpora such as the MEDLINE corpus of abstracts (2). approach is well-established in distributional semantics, established methods either use the full term-term space or However, the semantic relatedness estimated by most reduce its dimensionality with the computationally distributional models is of a general nature. These models demanding Singular Value Decomposition (SVD). RI is do not encode the type of relationship that exists between able to achieve this dimension reduction step at a fraction terms, which limits their ability to support logical of the cost of SVD by constructing semantic vectors for inference. Furthermore, while distributional models such each term on-the-fly, without the need for a term-by-term as Latent Semantic Analysis (LSA) simulate human matrix. Each term in the text corpus is assigned an performance in many cognitive tasks (3), they do not elemental vector of dimensionality d (usually in the order represent the object-relation-object triplets (or of 1000), the dimensionality of a reduced-dimensional propositions) that are considered to be the atomic unit of semantic space within which the relatedness of terms will thought in cognitive theories of comprehension (4). In be measured. Elemental vectors are sparse: they contain this paper we address these issues by defining mostly zeros, with in the order of 10 non-zero values of Predication-based Semantic Indexing (PSI), a novel either +1 or -1. As there are many possible permutations distributional model of language that encodes semantic of these few non-zero values, elemental vectors tend to predications derived from MEDLINE by the SemRep be close-to-orthogonal to one another: their relatedness system (5) into a compact vector space representation. as measured with the commonly used cosine metric tends Associations captured by PSI complement those captured towards zero. This approximates a full term-by-term by existing models, and present new possibilities for matrix, but rather than assigning an orthogonal knowledge discovery and information retrieval. dimension to each term, RI assigns a near-orthogonal reduced-dimensional elemental vector. To encode We present in this paper a description of the theoretical additional information to do with word order, the and methodological basis of PSI, and include examples elemental vector for a given term is permuted to produce of the sorts of information the model encodes and a new vector, almost orthogonal to the vector from which retrieves discussed in context of possible applications. V1: [ 1 0 0 0 0 1 0 0 0 0 0 -1 0 0 0] V2: [ 0 1 0 0 0 0 1 0 0 0 0 0 -1 0 0] We derived a PSI space from a database of semantic predications extracted by SemRep from MEDLINE These vectors are orthogonal to one another: as there is citations dated between 2003 and September 9th 2008. no common non-zero dimension between them, their 13,562,350 predications were extracted from 2,634,406 cosine (or normalized dot-product) will be zero. V2 was citations by SemRep. Of these, predications involving derived from V1 by moving every value one position to negation (such as “DOES NOT TREAT”) are excluded, the right, and conversely this transformation can be leaving 13,380,712 predications which are encoded into reversed by moving every value in V2 one position to the the PSI space. We encode this predication information
left. This simple procedure is used by Sahlgren et al to using permutation-based RI. Rather than assigning encode word-order information into a term-term based elemental vectors to each term, we assign sparse semantic space. The semantic vector for each term elemental vectors (d=500) to each UMLS concept consists of the normalized linear sum of the permuted contained in the predications database. We then assign a elemental vector for every term with which it co-occurs, unique number to each of the included predication types with permutation encoding the relative position of each (such as “TREATS”). We create semantic vectors term in the sliding window. The reversible nature of this (d=500) for each UMLS concept in the database. Each transformation facilitates order-based retrieval. For time a given UMLS concept occurs in a predication, we example, a rotation one position to the right of all add to its semantic vector the elemental vector of the elements of the elemental vector for a term can be used other concept in the predication, permuted according to to generate a vector with high similarity to terms the predication type. For example, in the predication occurring one space to the left of it. Table I provides “Isoniazid TREATS Tuberculosis” we would add the some examples of order-based retrieval in a permutation- elemental vector for Tuberculosis (TB) to the semantic based space derived from the MEDLINE corpus of vector for Isoniazid (INH) but rotate every element in abstracts using the Semantic Vectors package (10). this elemental vector 39 (the number assigned to the predicate “TREATS”) steps to the left. Conversely, we would add to the semantic vector for TB the elemental vector for INH rotated 39 steps to the right. In this way we can encode the predication connecting these concepts.
We also construct a general distributional model of the UMLS concepts in the database of predications using the Reflective Random Indexing (RRI) model (15), by Table I: Order-based retrieval from MEDLINE. The “?”
creating document vectors for each unique PubMed ID in denotes the relative position of the target term. the database. Document vectors are created based on the terms contained in these citations: elemental vectors are In this paper, we adapt Sahlgren et al's method of assigned to each term, and document vectors are encoding word order information into a vector space to constructed as the normalized linear sum of the elemental encode semantic predications produced by the SemRep vector for each term they contain. Rather than using raw system (5), (11). SemRep combines general linguistic term frequency, we employ the log-entropy weighting processing, a shallow categorical parser and scheme, shown to enhance document representations in underspecified dependency grammar, with domain- several applications (3). A vector for each concept is specific knowledge resources: mappings from free text to constructed as the frequency-weighted normalized linear the UMLS accomplished by the MetaMap software (12), sum of the vector for each document it occurs in. the UMLS metathesaurus and semantic network (13) and the Specialist lexicon and lexical tools (14). SemRep PSI requires a modification of the conventional nearest uses these techniques to extract semantic predications, neighbor approach, as we are interested in the strongest from titles and abstracts in the MEDLINE database, as association between concepts across all predications. In shown in this example drawn from (5). Given the excerpt the modified semantic network used by SemRep (16), “… anti-inflammatory drugs that have clinical efficacy in there are 40 permitted predications between concepts the management of asthma,.”, SemRep extracts the when negations (e.g. exercise DOES NOT TREAT hiv) following semantic predication between UMLS concepts: are excluded. Semantic distance in PSI is measured by extracting all permutations of a concept, and comparing “Anti-Inflammatory Agents TREATS Asthma” the second concept to these to find the predication with

the strongest association. For elemental vectors, we Predication-based Nearest Neighbor Search
employ a sparse representation used in our previous work (2) which represents the dimension and sign of each of the 20 non-zero values. This allows for rapid generation of all possible permutations by augmenting the value that represents the index of each non-zero value. To further speed up this process in the EpiphaNet example (Figure 1), we extract the 500 nearest neighbors to a cue concept from the general distributional space (this should subsume the predication-based space: every concept in a predication must co-occur in a citation with the other concept concerned). We then perform predication-based nearest-neighbor search on these neighbors only. As it is possible to search either using elemental vectors as cues to retrieve semantic vectors or vice-versa, for the quantitative evaluations we assess associations in both directions to ensure accessing the strongest association.
Results and Discussion
Predication-based retrieval
In a manner analogous to the order-based retrieval
illustrated previously, it is possible to perform
Figure I: EpiphaNet for “staphylococcus aureus”
predication-based retrieval using permutations to It is possible to rapidly characterize a particular concept determine which UMLS concept the model has encoded for exploratory purposes by first finding the k-nearest with strong association to another concept in a particular neighbors in a general associative space, and searching predication relationship. Table II illustrates predication- amongst these for the best predications using PSI. Figure based retrieval. For example, the query “? TREATS I illustrates the nearest predication-based neighbors of Asthma” retrieves concepts for asthma treatments the concept “staphylococcus_aureus” which we have (sb-240563, also known as Mepolizumab, has recently extracted and visualized with the EpiphaNet software we been shown to reduce exacerbations in asthma (17)) . have developed for this purpose. EpiphaNet is based on the Prefuse visualization library (18) and as in our previous work (2) uses Pathfinder network scaling (19) to reveal the most significant associative links within a network of near neighbors. By reversing the encoding process used in PSI, we are able to retrieve both the type and direction of the predication relationship linking these concepts. This measure of semantic distance is different in nature to those used in prior distributional models. Rather than conflating many types of association into a single metric, this estimate is based on the strongest 1: salmeterol+fluticasone 0.33: vaginalis typed association between these concepts across all predications. Similar to the way in which existing distributional models extract compact vector-based term representations from large corpora, the PSI model produces a compact representation for all UMLS Table II: Predication-based retrieval with cosine
concepts in the 8.8GB database of semantic predications. associations between query and target concepts. The set of semantic vectors used for the PSI space used Interestingly, the top ranked results are not necessarily to generate Figure I occupies 300MB only, and stored the concepts that occur most frequently in this elemental vectors occupy a fraction of this space due to predication relationship. Rather, these results reflect the the sparse representation employed. To further assess the extent to which this relationship defines a particular extent to which predications are accurately encoded and concept, as the model represents concepts in terms of the retrieved, we extract at random 1000 concepts, and predications in which they occur in an extensional retrieve their 20 nearest predication-based neighbors. We manner. Concepts occurring exclusively in a particular consider neighbors with a cosine association above a predication with another concept are likely to rank highly threshold of the mean cue-to-neighbor association for in predication-based retrieval. As this is not ideal for these 1000 terms as “retrieved”. Using the database of many purposes, our future work will explore variants of predications extracted by SemRep as a gold standard, we PSI that select for frequency rather than exclusivity. o Precision = retrieved and accurate / all retrieved assigned UMLS class and predication-based = predications retrieved / minimum(20, up) distributional similarity may be a useful way to reveal inconsistencies in the assignment of semantic class where up denotes the number of unique predications for and/or the assignment of predications by SemRep.
cue term in the database. Results are shown in Table III.
Modeling Analogy
We find it is possible to model analogy within the PSI space by finding the predication that most strongly associates two terms and applying the rotation that corresponds to this predication to a third. While this work is presently at an early stage of development, it has produced some interesting results so far (Table V). Table III: Results for 1000 randomly selected concepts.
The model performs better for cue concepts with fewer unique predications: recall when only concepts with 20 or less unique predications are considered is 0.74, 0.8 and 0.8 at 500, 1000 and 1500 dimensions respectively, with precision at 0.95 and above. This suggests that vectors for concepts involved in many predication relationships acquire a spurious similarity to other vectors due to Table V: Analogical reasoning in PSI-space.
partial overlap between permuted elemental vectors. We anticipate this overlap would reduce as dimensionality Application to Information Retrieval
increases. In practice we find that concepts such as Similarly to the way in which distributional models “patient” that are involved in many unique predications extract compact vector-based term representations from tend to be uninformative. It is also possible to eliminate large corpora, the PSI model produces a compact spurious neighbors by only considering terms that occur representation of the predication relations captured by in a document with the cue term as retrieval candidates. SemRep. The knowledge encoded in the PSI model could be used for information retrieval in several ways. One Implicit Encoding of Semantic Type
possibility would be to represent documents in terms of As illustrated by the results of the cosine-based nearest the predications contained therein, and allow users to neighbor search in Table IV, the PSI space to some extent search for documents containing concepts in a specific captures the semantic class of UMLS concepts.
predication relationship with a search concept. We anticipate that once customized for this purpose, PSI will retrieve documents providing answers to clinical questions such as “what treats Tuberculosis” or “what causes Bullous Impetigo”. Another possibility would be the use of the approach taken in Figure I to categorize documents according to the way in which they are related to a particular search concept. In our future work we will evaluate these approaches on standard test collections. Table IV: Nearest-neighbor searches in PSI-space.

Application to Literature-based Knowledge Discovery
The semantic vector for the disease “asthma” is similar to In our recent work (2),(20),(15) we have used general that for other diseases (and in this case, symptoms), just distributional models to identify potential discoveries by as “amitryptiline” retrieves other antidepressants through identifying pairs of concepts that are relatively close in nearest neighbor search. This finding generalizes to a the space but do not co-occur in any of the documents in degree: amongst the ten-nearest neighbors of 1000 the database used to generate the models. Although this randomly selected terms, an average of 37% share a method has proven to be effective in identifying UMLS semantic type with the cue term. This is interesting indirect connections, the interesting ones tend considerably higher than the result of approximately 5% to occur along with others of little interest. In general, obtained when the same evaluation is performed using additional constraints are needed to narrow the either RI (9)or RRI (15) (all spaces at d=500), and varies possibilities. The predications resulting from the methods across semantic types, with several semantic classes such presented here offer a promising means to limit the as “plant” exhibiting in excess of 80% agreement indirect connections by selecting those with appropriate between cue and neighbor. This is to be expected, as the predication relationships. For example, when looking for extraction of predications by SemRep is constrained by new treatments for a disorder, concepts that serve as the UMLS semantic type of the subject and object. treatments should be given priority over concepts in other However, further analysis of the interplay between predications. With these methods, general word space similarity can be elaborated into the greater specificity found in semantic network models (21).
Jones MN, Mewhort DJK. Representing word Limitations and Future Work
meaning and order information in a composite This paper presents the theoretical and methodological holographic lexicon. Psych. Review. 2007 ;1141-37.
basis for PSI, a novel distributional model that encodes Sahlgren M, Holst A, Kanerva P. Permutations as a predications produced by SemRep, and provides some Means to Encode Order in Word Space. Proc. 30th illustrative examples and possible applications. Further Annual Meeting of the Cognitive Science Society analysis is needed to determine the model parameters that (CogSci'08), July 23-26, Washington D.C.; 2008 ; optimize performance in each of these tasks. We do not Kanerva P, Kristofersson J, Holst A. Random evaluate the performance of SemRep, as this has been indexing of text samples for latent semantic evaluated elsewhere (5,16). In our future work we will analysis. Proc. of 22nd Annual Conference of the explore applications of PSI to informatics problems, including information retrieval, knowledge discovery and 10. Widdows D, Ferraro K. Semantic Vectors: A Scalable Open Source Package and Online Technology Management Application. Sixth Conclusion
International Conference on Language Resources PSI is a novel distributional model that encodes predications produced by the SemRep system, providing 11. Rindflesch TC, Fiszman M, Libbus B. Semantic a more specific measure of semantic similarity between interpretation for the biomedical research literature. concepts than is provided by existing distributional Medical informatics: Knowledge management and models, as well as the ability to retrieve the type of predication that most strongly associates two concepts. 12. Aronson AR. Effective mapping of biomedical text From a theoretical perspective, this is desirable as the to the UMLS Metathesaurus: the MetaMap program. unit of analysis in cognitive models is considered to be an object-relation-object triplet, not an individual term. 13. Bodenreider O. The unified medical language From a practical point of view, the additional information encoded by PSI is likely to be of benefit for information terminology. Nucleic Acids Research. 2004; 32 retrieval and knowledge discovery purposes. In our future work we will evaluate the application of PSI to 14. Browne AC, Divita G, Aronson AR, McCray AT. these and other informatics problems.
UMLS language and vocabulary tools. In: AMIA Annu Symp Proc. 2003. p. 798.
Acknowledgments
15. Cohen T, Schvaneveldt R, Widdows D. Reflective We would like to acknowledge Dominic Widdows, chief Random Indexing and Indirect Inference: A Scalable instigator of Semantic Vectors (10), some of which was Method for the Discovery of Implicit Connections. adapted to this work, and Sahlgren, Holst and Kanerva for their remarkable contribution to the field.
16. Ahlers CB, Fiszman M, Demner-Fushman D, Lang References
Cohen T, Widdows D. Empirical distributional semantics: Methods and biomedical applications. 17. Haldar P, Brightling CE, Hargadon B, Gupta S, Monteiro W, Sousa A, et al. Mepolizumab and Random Indexing and Pathfinder Networks. AMIA exacerbations of refractory eosinophilic asthma. N Engl J Med. 2009 Mar 5;360(10):973-84.
Landauer TK, Dumais ST. A solution to Plato's 18. Heer J, Card SK, Landay JA. prefuse: a toolkit for problem: The latent semantic analysis theory of interactive information visualization. Human acquisition, induction, and representation of Factors in Computing Systems. 2005 ;421-430.
knowledge. Psych. Review. 1997 ;104211-240.
19. Schvaneveldt RW. Pathfinder associative networks: Kintsch W. Comprehension : a paradigm for studies in knowledge organization. Ablex cognition. Cambridge, ; New York, NY: Cambridge Publishing Corp. Norwood, NJ, USA; 1990. 20. Schvaneveldt, RW, Cohen, TA. Abductive Rindflesch TC, Fiszman M. The interaction of Reasoning and Similarity. In: In: Ifenthaler D, Seel domain knowledge and linguistic structure in natural NM, editor(s). Computer based diagnostics and language processing: interpreting hypernymic systematic analysis of knowledge. Springer, NY; propositions in biomedical text. JBI 2003;36462-477 21. Quillian MR. Semantic memory. Minsky, M., Ed. Pado S, Lapata M. Dependency-Based Construction Semantic Information Processing. 1968 ;216-270.

Source: http://interlinkinc.net/Roger/Papers/CohenSchvaneveldtRindfleschAMIA2009_R2.pdf

110429 himj 41-2.indd

Social media in health – what are the safety concerns for health consumers? Annie Y.S. Lau, Elia Gabarron, Luis Fernandez-Luque and Manuel Armayones Abstract Recent literature has discussed the unintended consequences of clinical information technologies (IT) on patient safety, yet there has been little discussion about the safety concerns in the area of consumer health IT . This

saocamilo-sp.br

artigo originaL / research report / artícuLo Detection of antimicrobial-resistant gram-negative bacteria in hospital effluents and in the sewage treatment station of Goiânia, Brazil Detección de bacterias gram negativas resistentes a antimicrobianos en efluentes hospitalarios y en la estación de tratamiento de aguas residuales de Goiânia, Brasil Detecção de bactérias