Encoding the Parsing Tree Path Survay or Literature Review

Discovering microbe-illness associations from the literature using a hierarchical long short-term retentiveness network and an ensemble parser model

Yesol Park

¹Department of Computer Science and Engineering, Hanyang University, Seoul, Korea

Joohong Lee

^aneDepartment of Computer science and Technology, Hanyang University, Seoul, Korea

Heesang Moon

¹Department of Informatics and Technology, Hanyang University, Seoul, Korea

Yong Suk Choi

¹Department of Informatics and Engineering, Hanyang Academy, Seoul, Korea

Mina Rho

¹Section of Estimator Scientific discipline and Engineering science, Hanyang University, Seoul, Korea

²Section of Biomedical Computer science, Hanyang Academy, Seoul, Korea

Received 2020 Dec 9; Accustomed 2021 Feb 8.

Abstract

With recent advances in biotechnology and sequencing applied science, the microbial community has been intensively studied and discovered to exist associated with many chronic as well equally acute diseases. Even though a tremendous number of studies describing the association between microbes and diseases have been published, text mining methods that focus on such associations have been rarely studied. Nosotros propose a framework that combines motorcar learning and natural linguistic communication processing methods to analyze the association between microbes and diseases. A hierarchical long short-term memory network was used to discover sentences that draw the clan. For the sentences determined, 2 different parse tree-based search methods were combined to observe the relation-describing word. The ensemble model of constituency parsing for structural pattern matching and dependency-based relation extraction improved the prediction accurateness. By combining deep learning and parse tree-based extractions, our proposed framework could extract the microbe-disease association with college accuracy. The evaluation results showed that our organisation achieved an F-score of 0.8764 and 0.8524 in binary decisions and extracting relation words, respectively. As a case study, we performed a big-scale assay of the clan between microbes and diseases. Additionally, a set of common microbes shared past multiple diseases were also identified in this study. This study could provide valuable data for the major microbes that were studied for a specific disease. The code and information are bachelor at https://github.com/DMnBI/mdi_predictor.

Discipline terms: Data mining, Literature mining, Machine learning, Computational biology and bioinformatics, Microbiology

Introduction

With recent advances in biotechnology and sequencing engineering science, beneficial and deleterious effects of bacterial composition in humans and animals have been rigorously investigated. In particular, big-scale studies accept extensively investigated the microbial composition associated with a specific illness^¹–³. Several studies have reported the effects of diverse microbes on diverse diseases^⁴–⁶, including cancer^⁷, vascular disease^{^eight}, and autoinflammatory disease^⁹. Critical bacterial infections cause serious problems and even decease^¹⁰. Determining the role or the correlation of microbes in the development of a disease is very important to understand disease pathology and diagnosis markers.

Several studies take provided databases of the curated taxonomic information or sequencing resource related with microbes and diseases. For instance, Human Microbe-Illness Association Database (HMDAD) has 483 microbe-illness associations manually curated from 61 previously published articles^¹¹. Human Pan-Microbe Communities Database (HPMCD) provides over 1800 curated human gastrointestinal metagenome resources^¹². gutMDisorder provides microbe-related disorder and intervention information that were extracted from scientific articles^¹³. Even though these databases are valuable resource for analyzing diseases-related microbial information, such information was extracted from a limited number of publications. In club to apply wide resource publicly available as scientific articles more systematically and comprehensively, efficient text mining methods need to be developed. Recently developed computational methods predict the association between microbes and diseases^{^fourteen–¹⁹}. Such predictions are made from the pre-defined microbe-affliction association networks by using various graph algorithms and kernel functions. For case, KATZHMDA practical KATZ mensurate to summate the potential similarity betwixt microbes and diseases using a microbe-disease association network^¹⁴. Several variations were also introduced by using a depth-start search, neighbor-based collaborative filtering, Laplacian regularized to the lowest degree squares, bidirectional label propagation, and bi-random walk^{^xv–¹⁹}.

The number of published biomedical manufactures increases at an exponential rate, and extracting information from such a large-scale drove of literature requires a high cost. Efficient text mining methods have emerged to address this trouble. Methods have been adult using named entity recognition (NER), normalization of the entities, relation extraction, and relation classification^{²⁰–²⁴}. The NER and normalization of entities are important preprocessing steps for extracting relational information. In contempo years, machine learning approaches such as conditional random field and neural networks have been dominant^{^twenty–²⁵}. For case, BANNER^²⁰ is a trainable biomedical named entity recognition system based on provisional random fields^²⁶. Recurrent neural networks (RNN) have shown expert performance with natural language processing, and long short-term memory (LSTM) was adult to add together cell states to the RNN, which improved vanishing gradient bug^²⁷. DNorm^²² is a system used for normalizing illness names in biomedical texts by learning the similarities between mentions and concept names based on pairwise learning to rank^²⁸. Collections of biomedical terms such every bit gene ontology (Go)^²⁹, BioThesaurus^³⁰, unified medical language system^³¹, medical subject headings (MeSH) terms^³², and the Comparative Toxicogenomics Database^³³ have been used to solve this trouble. Resources such as the NCBI disease corpus^³⁴ and BioCreative V CDR corpus (BCVCDR)^³⁵ accept been used as the gold standard in preparation and testing data for NER and normalization.

To excerpt or classify relationships betwixt biomedical entities, rule-based decision, pattern matching, or car learning have been explored^{²⁵,³⁶–⁴⁶}. RelEx is a method for predicting interactions by applying rules to dependency parse trees, focusing on the relationship betwixt genes and proteins^³⁶. @MInter predicts interactions betwixt microorganisms using support vector machines and builds a database^³⁷. Protein–poly peptide interaction and drug–drug interaction (DDI) have been explored in the biomedical literature to identify positive and negative influences betwixt proteins and between drugs, respectively^{²⁵,³⁸–⁴²}. For extraction of such relations, two dissimilar types of results are expected. First, the relation between entities is straight detected, and second, such a relation is classified into predefined classes, such as ADVICE, Effect, INT, and Machinery in DDI interaction. The relationship betwixt diseases and genes has besides been explored^{²⁵,⁴³,⁴⁴}. The microbial phenotypic traits and other associations that were obtained from the literature take been investigated by network analysis^⁴⁶.

Although a tremendous amount of literature related to microbes and diseases is bachelor, text mining methods that focus on the relation between microbes and diseases have been rarely studied to date. In this study, we have adult a method that extracts the microbe-illness relationship from the biomedical literature by combining tongue processing (NLP) and machine learning methods. Our NER and normalization methods for microbes and diseases were applied in the pre-processing. A variant of RNN was constructed to obtain sentences that contain a microbe-illness relationship. After, the relation words were predicted from the retained sentences by combining the results from two different parsing methods. As a case study, a big-scale microbe-disease relation network assay was performed to provide valuable information on whether a set of specific microbes are common or exclusive to a given disease or not. Since the proposed method provides a systematic way of extracting the microbe-illness relations with high accuracy from scientific literature, it can be a useful resource for studying the microbial involvement in disease development and pathophysiology in a comprehensive manner. Considering the massive size of scientific literature, the current databases of microbe-affliction relations comprise simply a limited number of publications. Therefore, our large-scale text analysis arroyo could provide more detailed information of the microbe-disease relation.

Materials and methods

The proposed organization consists of three steps: (1) NER that annotates the terminology for microbe and disease using a dictionary-based method and a semi-Markov model; (2) binary classification for relation detection using a hierarchical LSTM model; and (iii) an ensemble method for relation extraction, which uses constituency parsing-based structural pattern matching and dependency-based relation extraction (Fig.1). In the ensemble method, the confidence scores were calculated to excerpt relations more accurately by complementing two different approaches.

An external file that holds a picture, illustration, etc. Object name is 41598_2021_83966_Fig1_HTML.jpg

Workflow of extracting the associations between microbes and diseases. The NER process uses two approaches: a semi-Markov model for affliction and a dictionary-based method for bacteria, bacterial strain, and virus. Relation is adamant using a hierarchical LSTM, and the relation give-and-take was extracted past an ensemble model of constituency parsing-based and dependency-based methods.

Drove of biomedical corpus and named entity recognition

To evaluate the operation of relation detection, two different information sets were used in this study. The first data ready is a gilt standard for drug–drug interaction DDIE2013^{⁴⁷,⁴⁸}. The 2d data fix of microbe-disease association is generated in-house. A total of 1100 random sentences that incorporate the names of both the affliction and the microbe were obtained from PubMed abstracts. The relation between ii entities of microbe and disease was manually annotated by domain experts. If one or more relations were constitute between entities in the judgement, each pair was annotated. Among the words that depict the relation, a more specific word was regarded as the relation word. For instance, in the judgement "BAC00Pseudomonas_aeruginosa is a pathogen that oft causes DIS00acute_lung_injury.", the word causes was regarded as the relation word. Among 1100 sentences, 1000 were used for training, and 100 sentences were used for testing.

The entities of microbe and disease were recognized independently from the sentences. For disease, nosotros performed NER and normalization using TaggerOne^²⁴, a machine learning tool that recognizes and normalizes multiple concept entities using a semi Markov model. TaggerOne divides the sentences into segments consisting of one or more tokens. It after performs NER and normalization simultaneously by estimating the score for the segment equally the sum of the NER score and the normalization score. We used two TaggerOne models, which were trained using NCBI and BCVCDR corpus, respectively. In order to avoid mis-note of affliction names with bacteria and virus names, dictionary-based NER was as well practical based on the NCBI taxonomy data. The bacterial names were extended to include specific strain data. The list of bacteria, bacterial strains, and a list of viruses were downloaded from the NCBI website.

Relation detection with hierarchical long short-term retentiveness

System overview for relation detection between entities

Nosotros used hierarchical LSTM to determine the existence of the relationship betwixt bacteria and disease. The LSTM was synthetic hierarchically, considering the entities, which were adapted and improved from a previous report^⁴⁹. The hierarchical LSTM model consists of six layers: an input layer, embedding layer, attention layer, bottom LSTM layer, top LSTM layer, and output layer (Fig.2). The input of the hierarchical LSTM includes a judgement and its shortest dependency path. A sentence contains ii entities, which divides the sentence into iii phrases: the words before the offset entity, the words between two entities, and the words after the second entity. The shortest dependency path was obtained from the sentence by Stanford dependency parser to farther consider contextual pregnant.

An external file that holds a picture, illustration, etc. Object name is 41598_2021_83966_Fig2_HTML.jpg

The overview of hierarchical long brusk-term memory model in this study. Words, lexical category for words, dependency tags for words, and relative positions to entities for words are used as features in the model. The model consists of embedding layer, attention layer, bottom LSTM, pinnacle LSTM, and softmax classifier. As input, the phrase earlier the beginning entity ( ${Seq}_{1}$ ), the beginning entity ( $E_{ane}$ ), the phrase between two entity ( ${Seq}_{ii}$ ), the second entity ( ${East}_{2}$ ), the phrase afterwards the second entity ( ${Seq}_{3}$ ), and shortest dependency path (SDP) are entered into the model.

Embedding layer

In the embedding layer, each sentence was divided into words that were vectorized. For give-and-take embedding, we used word2vec models^⁵⁰, which were retrained with a corpus from DDI^{⁴⁷,⁴⁸}, and the bacteria-affliction relation were generated in this report using biomedical scientific literature in PubMed and PMC^⁵¹. The additional features for each discussion were obtained from part of speech (POS), dependency tag, and positions. POS and dependency tags were obtained by the Stanford dependency parser, and they could better express the discussion because POS and dependency tags of the same discussion were dissimilar depending on the sentence. For POS and dependency tag embedding, a word2vec model was applied. The position of each word was the relative distance from the discussion to the entities. Positions were represented by one-hot encoding depending on the distance. The vector size for word embedding, POS, dependency tag, and positions was 200, 10, x, and 20, respectively.

Attention layer

In the attention layer, entity-based attending^⁵² was used. In the entity-based attention, the weight of the discussion $w_{i}$ based on the entities ${east}_{1}$ and ${east}_{2}$ is defined as follows:

$θ_{wi}^{k} = \frac{e 10 p (d o t (w_{i}^{word}, e_{g}^{word}))}{\sum_{j = one}^{m} e x p (d o t ({west}_{j}^{word}, {eastward}_{k}^{word}))} (k \in \{i, 2\})$

If a specific word is closer than another word for an entity in the embedding infinite, it is given more weight through the dot product. Since we classify sentences for entities, nosotros control weights for the entities.

LSTM layers

The bottom LSTM layer consists of four LSTMs: three LSTMs for three phrases in a sentence that were divided by two entities, and an LSTM for the shortest dependency path. LSTMs for three fragments have fixed 60 time steps, and an LSTM for the shortest dependency path has fixed 12 fourth dimension steps. Each LSTM has a hidden size of 100. In an example judgement, "additionally, in otherwise healthy people, vulnificus causes wound infection that tin require amputation or atomic number 82 to sepsis" 2 pairs of the relation between vulnificus and infection, and between vulnificus and sepsis were extracted. For a pair between vulnificus and infection, three phrases, "additionally, in otherwise healthy people", "causes wound", and "that tin can require amputation or pb to sepsis" were obtained. The shortest dependency path in the example sentence is "causes wound". Each sentence is padded or cut by the time steps before placing the LSTMs. Each LSTM consists of many to one bidirectional LSTMs (bi-LSTM), and the final upshot of the bottom LSTM layer is a four × 200 matrix.

The acme LSTM layer is a bi-LSTM, which consists of six fourth dimension steps and has a hidden size of 100. Each entity was embedded as a ane × 200 matrix, which was combined with the results of the bottom LSTM layer to class a 6 × 200 matrix to be an input in the tiptop LSTM. The top LSTM outputs a vector of length 200, and this output passes through a feed-forward neural network. The feed-forward neural network finally outputs the binary class results using the softmax function.

Constituency parsing-based structural pattern matching

In club to excerpt relation words, a parse tree-based structural pattern-matching method, TPEMatcher^⁵³, was adjusted to our problem. TPEMatcher uses tree pattern expression (TPE) as a search query to express the structural pattern of parse trees. It allows the use of regular expressions to reveal string patterns and can limited grammatical patterns of parse trees. Furthermore, TPE extracts information from a large text corpus with very low computational complexity.

TPE patterns are matched to each parse tree of a judgement in order to produce the matched parts of the parse tree as a search upshot. Nosotros constructed a set of 59 TPE patterns comprising microbes, diseases, and relation words from the corpus. For example, the TPE pattern "{.+* {NP * <N. +i#BAC00. +>*} <, ,>{NP * <N. +3#. +>{PP <IN *>* {NP * <Northward.+ 2#DIS00.+ >*} *} *} <, , >*}" is one of these patterns to extract the relation triplets from the appositive phrase with commas (Fig.iii).

An external file that holds a picture, illustration, etc. Object name is 41598_2021_83966_Fig3_HTML.jpg

An instance judgement processed past parse tree-based structural pattern matching. A given sentence is parsed to find the structural dependency. From the parsed judgement, two dissimilar tree pattern expressions, TPE pattern 1 and two were extracted from 59 predefined TPE patterns. The TPE blueprint i (in blue) extracts the microbe-disease-relation triplet of (DIS00sepsis, BAC00Klebsiella_pneumoniae, cause). The TPE pattern 2 (in red) extracts the other triplet of (DIS00pneumonia, BAC00KLEbsiella_pneumoniae, associated).

To parse sentences, the Stanford CoreNLP analyzer was practical. Each node of the TPE pattern was matched to a node or a subtree of the parse tree. Among the matched nodes of the parse tree, TPEMatcher extracted the words corresponding to microbes, diseases, and relations from nodes matched by "i#BAC.+", "2#DIS.+", and "iii#.+", respectively. In the case of TPE pattern i, BAC00Klebsiellapneumoniae, DIS00sepsis, and cause were extracted, and and so these words were stemmed and bundled into triplets (BAC00Klebsiella_pneumoniae, DIS00sepsis, cause) as the terminal output of TPEMatcher (Fig.iii). In addition, some other triplet relation was found: (BAC00Klebsiella_pneumoniae, DIS00pneumonia, associate). To excerpt such a triplet, we crafted another TPE pattern "{S * {NP * <N.+ BAC00.+> *} * <VP <VBN .+> * <PP <IN *> * {NP * <N.+ DIS00.+> *} *> *> *}" for passive sentences with by participle (Fig.iii). As a final result, TPEMatcher extracted 2 triplets, (BAC00Klebsiella_pneumoniae, DIS00sepsis, cause) and (BAC00Klebsiella_pneumoniae, DIS00pneumonia, associate) from the given sentence.

Dependency parsing-based relation extraction

In society to extract relations between microbes and diseases, dependency trees were built from the sentences using the Stanford CoreNLP library^⁵⁴. Since dependency parsing captures long-range syntactic relations, it can exist complementary to the constituency parsing in relation extraction. Before the tree is traversed, iii preprocessing steps to simplify the prediction procedure were performed: (1) chunking a grouping of words with the pattern of word (of|with) entity and a compound relation between the entity and its parent node; (2) excluding a pair of entities with the distance of more than 4 in the dependency tree (edges of conj, conj:and, conj:or, chemical compound, or appos were not counted in the altitude), and (3) extracting simple effector-effected relations that are continued by prepositions and relation words such as by, in, from, on, with, of, due to, induced, and betwixt.

In the dependency tree, the subtree with a root of the lowest mutual ancestor (LCA) node betwixt the two entities has essential information for the relation between the entities. In improver to the LCA node, more descriptive relation word can be in the child node of LCA. If the LCA node has a child node that is connected by the edges such as acl, acl:relcl, amod, xcomp, ccomp, appos, nmod:every bit, conj:and, conj:or, advcl, and dep, the child is assigned as the relation word. For example, the relation give-and-take implicated was observed from our algorithm in a given sentence "BAC00Stenotrophomonas _maltophilia is an emerging pathogen implicated in an increasing number of DIS00severe_pulmonary _infections." (Fig.4). When ane of the two entity nodes is LCA, the relation betwixt the entities is extracted from the edge. The candidate pair might not accept an LCA node. When the border is not a preposition, the give-and-take combined with the entity as a chunk, is a relation.

An external file that holds a picture, illustration, etc. Object name is 41598_2021_83966_Fig4_HTML.jpg

An case of microbe-disease relation extracted from a dependency tree. Two entity nodes, 'BAC00Strenotrophomonas_maltophilia' and 'DIS00severe_pulmonary_infections', accept a lowest mutual ancestor of 'pathogen', which has a more than descriptive child node of 'implicated' without a descriptive child node. Therefore, 'implicated' is extracted as the relation word between two entities of microbe and disease.

When more than than two words of the same entity type are continued, this phrase is represented equally ancestor–descendant nodes in the tree. For instance, there is an annotated sentence "BAC00Streptococcus _pneumoniae, the pneumococcus, is the most common_cause of DIS00sepsis and DIS00meningitis." The entity DIS00meningitis has the parent node of the same blazon, which is DIS00sepsis. When DIS00sepsis has a common_cause relationship with BAC00Streptococcus_pneumoniae, a relation between BAC00Streptococcus_pneumoniae and DIS00meningitis is also common_cause because information technology inherits the relationship from the parent node.

Ensemble model to combine relations

To combine the results from two complementary approaches of relation extraction, an ensemble model was applied. We assumed that the correctness of an extracted relation triplet is highly dependent on its relation discussion and extraction patterns in each module. Thus, we ascertain the confidence scores based on Bayes' theorem to determine the reliability of an extracted relation. The confidence of a relation triplet is determined by the maximum likelihood of patterns that excerpt the triplet as follows:

where $r_{j}$ is an extracted relation triplet that contains a relation word j, and $p_{i}$ is the i-th blueprint that extracts the triplet. $Pr (p_{i} | r_{j})$ is the probability that the design $p_{i}$ correctly extracts the relation word $r_{j}$ .

The conditional probability $Pr (p_{i} | r_{j})$ is calculated every bit given in Eq. (⁴),

$Pr (p_{i} | r_{j}) = \frac{Pr (r_{j} | p_{i}) Pr (p_{i})}{Pr (r_{j})} = \frac{Pr (r_{j} | p_{i}) Pr (p_{i})}{Pr (r_{j} | p_{i}) Pr (p_{i}) + Pr (r_{j} | \neg p_{i}) Pr (\neg p_{i})}$

four

where $Pr (p_{i})$ is the prior probability that $p_{i}$ is correct, which is equivalent to the precision of the blueprint, and $Pr (r_{j} | p_{i})$ is the probability that the pattern $p_{i}$ extracts $r_{j}$ when $p_{i}$ is correct.

Results

Performance evaluation of relation detection and extraction

Nosotros first evaluated our model using the DDIE2013 data obtained from a previous written report^{⁴⁷,⁴⁸}. Because the label in this dataset is the absenteeism or existence of the relationship between entities, nosotros only evaluated the first part of our method, which is relation detection. The training set consisted of 4018 positive DDIs and 23,756 negative DDIs, and the test ready consisted 979 positive DDIs and 4737 negative DDIs. We used the softmax role and Adam optimizer for binary classification and measured precision, recall, and F-score. In the training, the learning charge per unit was 0.001, the training epoch was 30, the input layer dropout charge per unit was 0.seven, and the output layer dropout rate was 0.5. For the test set, we achieved a precision charge per unit of 0.822, recall of 0.778, and F-score of 0.800 for the binary classification (Table 1). In comparing with the existing methods, our method showed ameliorate operation than most of the current auto learning-based methods except one. Compared to Tree-LSTM'south Two-Stage Model, the F-score was lower, but precision was 1.half dozen% college.

Table 1

Performance evaluation for relation detection using DDIE2013 and dataset.

Model	Precision	Call up	F-score
SCNN'south two-stage model^³⁸	77.5	76.9	77.2
Tree-LSTM's two-phase model^⁵⁶	80.6	84.ii	81.8
pubmedBERT^²⁵	89.two	90.1	89.half-dozen
Our two-phase model	82.two	77.8	fourscore.0

We also evaluated the entire model of both relation detection and extraction using an in-house evaluation dataset for microbe-disease interaction. Since the golden standard dataset is not available for microbe-disease interaction, we randomly selected 1000 sentences with 1269 positive relations and 572 negative relations from the abstracts downloaded from the PubMed repository (Run into 'Method"). We performed tenfold cross-validation to improve the reliability of the evaluation. The sentences were split into 10 subsets of 100 sentences. The nine subsets were used equally training data, and the remaining subset was used equally validation data. The validation process was performed ten times, and each subset was used as validation data in one case. Finally, the results were averaged to summate a unmarried estimate, which resulted in a precision of 0.832, a recall of 0.848, and an F-score of 0.839 for all pairs of microbes and illness, on average (Fig.5). When the accuracy was evaluated past sentence, it resulted in a precision of 0.898, a recall of 0.905, and an average F-score of 0.901, which is slightly higher than that of the entity pair.

An external file that holds a picture, illustration, etc. Object name is 41598_2021_83966_Fig5_HTML.jpg

Operation of our method on bacteria-illness relation extraction. In tenfold cross validation, the mean precision was 0.832, the hateful recall was 0.848, and the mean F-score was 0.839 for the pairs of leaner and disease. For the sentences, hateful precision was 0.898, the hateful recollect was 0.905, and the hateful F-score was 0.901.

Combined with relation detection, the relation extraction model was evaluated using macro-averaged precision, recall, and F-score as performance measures. To place a skillful confidence threshold in the ensemble model, tenfold cross-validation was performed to evaluate the extraction accuracy for each of the x conviction thresholds (from 0.0 to 0.9 at intervals of 0.1). Every bit shown in Supplementary Table ^S1, the F-score is the best when the confidence threshold is 0.5, which was used for farther analysis. When comparing the performance of the three approaches, structural blueprint matching, dependency-based extraction, and the ensemble with these two methods, the ensemble method shows higher F-scores than either method, which implies that the methods successfully complement each other in the ensemble model. Tabular array 2 shows the performance for relation extraction with three approaches using two unlike confidence thresholds of 0.0 and 0.5.

Tabular array ii

Operation evaluation for extracting relation words betwixt microbe and disease entities.

Method	Confidence threshold = 0.0			Conviction threshold = 0.v
Method	Precision	Call back	F-score	Precision	Call back	F-score
TPE	81.04	67.09	73.28	92.53	64.60	75.89
DBE	77.78	72.09	74.78	90.46	70.87	79.35
Ensemble	77.47	82.63	79.87	89.33	81.76	85.24

TPE structural design matching but, DBE dependency-based extraction only.

Discovery of frequent associations between microbe and affliction

Our system was applied to analyze the microbe-illness association plant in literature. Abstracts with the keyword 'bacteria' were collected from the Medline literature collection. After applying NER for microbe and disease names, 71,899 sentences were found to contain words related to disease and microbes, from which 52,251 sentences were predicted equally sentences that draw microbe-illness association by our hierarchical LSTM classifier. Using the ensemble model, a full of 60,467 microbe-illness relations were extracted. To better analyze the clan to the specific disease, the 14,306 clan pairs related to the named entity 'infection' were excluded. For reliability, when the number of pairs for a specific microbe-illness association was below the boilerplate frequency (< iv), such an association was not included for further analysis. Finally, a full of thirty,085 associations were retained, which were categorized based on the MeSH disease categories (Fig.6).

An external file that holds a picture, illustration, etc. Object name is 41598_2021_83966_Fig6_HTML.jpg

Distribution over the top-level MeSH affliction categories and the bacterial families. The distribution of thirty,085 relationships between 432 diseases and 319 leaner is represented. The bacteria are shown in the summit 20 bacterial families and others.

Among 24 MeSH disease categories, 'Infections [C01]' is the category where the most abundant disease-microbe associations were institute from the biomedical literature, followed by 'Pathological Conditions, Signs and Symptoms [C23]', 'Digestive Organisation Diseases [C06]', 'Respiratory Tract Diseases [C08]', and 'Neoplasms [C04]' (Tabular array 3). The v nigh frequent bacterial families were Enterobacteriaceae, Helicobacteraceae, Streptococcaceae, Mycobacteriaceae, and Staphylococcaceae, which constituted 16.52%, 13.11%, eight.xiii%, half-dozen.51%, and 5.96% of bacteria-affliction associations, respectively (Table 4).

Tabular array iii

Number of microbe-disease relations clustered by MeSH affliction categories.

Disease category	# of relations	# of microbes	# of diseases
Infections	12,250 (40.72%)	210	171
Pathological weather condition, signs and symptoms	7540 (25.06%)	148	69
Digestive system diseases	5137 (17.07%)	77	50
Respiratory tract diseases	4209 (13.99%)	81	43
Neoplasms	2632 (eight.75%)	62	26
Female urogenital diseases and pregnancy complications	1831 (6.09%)	63	34
Nervous system diseases	1632 (v.42%)	55	40
Cardiovascular diseases	1183 (3.93%)	60	24
Male urogenital diseases	1094 (3.64%)	35	23
Chemically-induced disorders	1046 (3.48%)	49	eleven

Table 4

Number of microbe-disease relations categorized by bacterial families.

Bacteria family	# of relations	# of leaner	# of diseases
Enterobacteriaceae	4969 (sixteen.52%)	17	139
Helicobacteraceae	3944 (xiii.11%)	vii	90
Streptococcaceae	2446 (8.thirteen%)	21	72
Mycobacteriaceae	1958 (six.51%)	fifteen	48
Staphylococcaceae	1794 (5.96%)	8	77
Pseudomonadaceae	1669 (v.55%)	5	66
Pasteurellaceae	1194 (3.97%)	15	66
Chlamydiaceae	933 (3.1%)	six	48
Peptostreptococcaceae	682 (two.27%)	3	19
Streptomycetaceae	658 (two.nineteen%)	2	50

In the Infections category [C01], a full of 12,250 relations were extracted, of which 210 leaner and 171 diseases were involved. The most frequent disease was pneumonia, which too belongs to another category of respiratory tract diseases in MeSH. The species frequently associated with pneumonia were Streptococcus pneumoniae in 363 relations, Pseudomonas aeruginosa in 148, Staphylococcus aureus in 138, and Mycoplasma pneumoniae in 107. The other frequent diseases were tuberculosis, sepsis, and bacteremia, establish in 1,050, 911, and 637 relations with xv, 47, and 38 bacteria, respectively.

In Digestive System Diseases [C06], 5137 relations related to 77 bacteria and 50 diseases were extracted from the literature (Fig.7). The disease with the about abundant microbes was cystic fibrosis, which is also associated with respiratory tract disease and genetic diseases in MeSH categories. Since its physiology is related to the pancreas and intestine in addition to lung infection^⁵⁵, diverse roles and effects of bacteria take been studied. Cystic fibrosis had 735 relationships with xvi bacteria. The other frequent diseases were tum neoplasms, gastritis, and gastroenteritis with 650, 558, and 470 relations, respectively. The well-nigh frequent bacteria were Helicobacter pylori in 2081 relations, Escherichia coli in 463, Pseudomonas aeruginosa in 451, and Clostridium difficile in 244.

An external file that holds a picture, illustration, etc. Object name is 41598_2021_83966_Fig7_HTML.jpg

Network for digestive arrangement diseases. A illness node (circle) and a microbe node (square) are connected by an edge when more than iv relations are extracted. The size of node is proportional to the number of extracted relations betwixt the illness or microbe, and the color of node represents summit-level MeSH illness categories to which the affliction belongs. The width of border is proportional to the frequency at which the relation is extracted.

In respiratory tract diseases [C08], there were a total of 4209 relations that consisted of 43 diseases and 81 bacteria from 30 bacterial families. The 2 diseases highly associated with microbes were pneumonia and cystic fibrosis, which were the about arable in infection and digestive disease categories, respectively. The other frequent diseases were lung diseases in 307 relationships and respiratory insufficiency in 246 relationships. The most frequent leaner were Streptococcus pneumoniae in 502 relationships, Pseudomonas aeruginosa in 346, and Haemophilus influenzae in 243.

Disease-illness human relationship based on shared leaner

To investigate the similarity between diseases with respect to shared bacteria, a Jaccard index was practical. The higher the Jaccard index, college the relation betwixt the two diseases and the bacteria involved. For similarity calculation, the diseases associated with simply one common bacterium were excluded, which tin can provide more than reliable pairs of diseases with common bacteria. As a result, the similarity of 8958 pairs of diseases was calculated from the 230 diseases retained, ranging from 1 to 100%. Figure8 shows a disease–disease network with a Jaccard index of 60% or more like among diseases. The network consisted of 71 diseases, and 89 pairs of diseases shared more than than threescore% of microbes.

An external file that holds a picture, illustration, etc. Object name is 41598_2021_83966_Fig8_HTML.jpg

Disease similarity network. The network shows Jaccard similarities among diseases calculated with relevant microbes. The size of node is proportional to the number of microbes involved. The color of node represents elevation-level MeSH illness categories to which the disease belongs. The disease nodes are connected past an border if Jaccard similarity was 60% or more than. The width of edge is proportional to the similarity among nodes.

For a similarity network with a Jaccard index of 60% or more, the largest node is periodontitis related to 10 bacteria, followed by sinusitis, otitis, and tumour invasiveness. Well-nigh diseases show a loftier Jaccard index for diseases in the same MeSH categories. In Fig.viiia, for case, all diseases except chlamydia infections belonged to cardiovascular diseases. All diseases of the sub-network were related to Chlamydophila pneumoniae, of which cardiovascular diseases were also related to Helicobacter pylori. Figure8b shows the similarities betwixt respiratory tract diseases and otorhinolaryngologic diseases. The diseases in the sub-network belonged to different categories, but all of them were associated with Streptococcus pneumoniae and Haemophilus influenzae. In particular, respiratory tract infections and otitis media showed a high Jaccard similarity of fourscore%, despite belonging to unlike categories. They shared relationships with four bacteria: Streptococcus pneumoniae, Haemophilus influenzae, Moraxella catarrhalis, and Pseudomonas aeruginosa.

Conclusion

In this article, we introduced a process that combines tongue processing and machine learning methods to analyze relations between diseases and microbes. A hierarchical LSTM model with six layers was proposed to discover the existence of relationships between microbe and affliction within sentences. In this process, the hierarchical LSTM model was used to determine the presence or absence of relationships in a sentence. For sentences that were determined to have relations, two dissimilar parsing methods extracted relation words. Both results were combined using an ensemble model based on Bayes' theorem. Our model not merely detected the human relationship betwixt the diseases and microbes but as well predicted the relation give-and-take between them. Evaluation of the results showed that our process accomplished an F-score of 0.8764 and 0.8524 in binary decisions and extracting relation words, respectively. As a example report, nosotros performed a large-scale analysis of the relationship between microbes and disease. Additionally, a gear up of common microbes shared by multiple diseases was identified in this study. This investigation could provide information on the major microbes that are institute or studied for a specific illness. Several databases of microbe-disease association are currently available, which are based on the analysis of simply a limited number of publications. Our method represents the first systematic approach to find microbe-disease relation from the scientific articles by using an entire process from named entity recognition to relation word extraction. This approach allows a large-scale assay on microbe-illness association with detailed information described in the literature.

Supplementary Information

Acknowledgements

This work was supported by Bio & Medical Applied science Development Program of the National Inquiry Foundation of Korea (NRF), funded by the Ministry building of Science, ICT & Future Planning (2017M3A9F3041232), Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea Government (MSIT) [No. 2020-0-01373, Artificial Intelligence Graduate Schoolhouse Program (Hanyang University)], and the Collaborative Genome Programme of the Korea Institute of Marine Scientific discipline and Technology Promotion (KIMST), funded by the Ministry building of Oceans and Fisheries (MOF) [No. 20180430].

Author contributions

Y.P., J.L., and H.M. implemented the system and performed the evaluation. Y.P. applied the organisation to analyze the results. Y.C. and M.R. designed the stud, performed the analysis, and supervised the written report. All authors wrote the manuscript.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's notation

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Yong Suk Choi, Electronic mail: rk.ca.gnaynah@syc.

Mina Rho, Email: rk.ca.gnaynah@ohranim.

Supplementary Data

The online version contains supplementary fabric available at 10.1038/s41598-021-83966-8.

References

i. Shoemark DK, Allen SJ. The microbiome and disease: reviewing the links between the oral microbiome, aging, and Alzheimer's disease. J. Alzheimer'due south Dis. 2015;43(3):725–738. doi: 10.3233/JAD-141170. [PubMed] [CrossRef] [Google Scholar]

2. Jie Z, et al. The gut microbiome in atherosclerotic cardiovascular disease. Nat. Commun. 2017;8(1):ane–12. doi: 10.1038/s41467-017-00900-one. [PMC complimentary article] [PubMed] [CrossRef] [Google Scholar]

three. Vatanen T, et al. The human gut microbiome in early-onset blazon 1 diabetes from the TEDDY written report. Nature. 2018;562(7728):589–594. doi: 10.1038/s41586-018-0620-2. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

4. Laureano AC, Schwartz RA, Cohen PJ. Facial bacterial infections: folliculitis. Clin. Dermatol. 2014;32(6):711–714. doi: 10.1016/j.clindermatol.2014.02.009. [PubMed] [CrossRef] [Google Scholar]

5. Jorth P, et al. Metatranscriptomics of the human oral microbiome during health and disease. mbio. 2014;5(ii):e01012–e1014. doi: 10.1128/mBio.01012-14. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

6. Zhao, Y., Wang, C.-C., & Chen, X. Microbes and complex diseases: from experimental results to computational models. Brief. Bioinform. (2020). [PubMed]

8. Desvarieux Thou, et al. Periodontal microbiota and carotid intima-media thickness: the oral infections and vascular disease epidemiology report (INVEST) Apportionment. 2005;111(5):576–582. doi: 10.1161/01.CIR.0000154582.37101.15. [PMC gratuitous commodity] [PubMed] [CrossRef] [Google Scholar]

ix. Lukens JR, et al. Dietary modulation of the microbiome affects autoinflammatory affliction. Nature. 2014;516(7530):246–249. doi: 10.1038/nature13788. [PMC gratis article] [PubMed] [CrossRef] [Google Scholar]

10. Ishigaki K, et al. A case of Streptococcus suis endocarditis, probably bovine-transmitted, complicated by pulmonary embolism and spondylitis. Kansenshogaku Zasshi. 2009;83(5):544–548. doi: x.11150/kansenshogakuzasshi.83.544. [PubMed] [CrossRef] [Google Scholar]

11. Ma W, et al. An assay of human microbe-disease associations. Cursory Bioinform. 2017;18(ane):85–97. doi: 10.1093/bib/bbw005. [PubMed] [CrossRef] [Google Scholar]

12. Forster SC, et al. HPMCD: the database of homo microbial communities from metagenomic datasets and microbial reference genomes. Nucleic Acids Res. 2016;44(D1):D604–D609. doi: 10.1093/nar/gkv1216. [PMC gratuitous commodity] [PubMed] [CrossRef] [Google Scholar]

xiii. Cheng 50, et al. gutMDisorder: a comprehensive database for dysbiosis of the gut microbiota in disorders and interventions. Nucleic Acids Res. 2020;48(D1):D554–D560. doi: 10.1093/nar/gkz843. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

fourteen. Chen X, et al. A novel approach based on KATZ measure to predict associations of man microbiota with non-infectious diseases. Bioinformatics. 2017;33(5):733–739. [PubMed] [Google Scholar]

15. Huang ZA, et al. PBHMDA: path-based homo microbe-disease association prediction. Front. Microbiol. 2017;8:233. [PMC free article] [PubMed] [Google Scholar]

sixteen. Huang YA, et al. Prediction of microbe–illness association from the integration of neighbor and graph with collaborative recommendation model. J. Transl. Med. 2017;xv(1):209. doi: 10.1186/s12967-017-1304-7. [PMC costless article] [PubMed] [CrossRef] [Google Scholar]

17. Wang F, et al. LRLSHMDA: laplacian regularized least squares for human microbe-disease association prediction. Sci. Rep. 2017;vii(1):7601. doi: ten.1038/s41598-017-08127-2. [PMC complimentary article] [PubMed] [CrossRef] [Google Scholar]

18. Wang L, et al. A bidirectional characterization propagation based computational model for potential microbe-illness association prediction. Forepart. Microbiol. 2019;10:684. doi: 10.3389/fmicb.2019.00684. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

19. Yan C, et al. BRWMDA: predicting microbe-affliction associations based on similarities and bi-random walk on disease and microbe networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 2020;17(five):1595–1604. [PubMed] [Google Scholar]

20. Leaman, R., & Gonzalez, One thousand. BANNER: an executable survey of advances in biomedical named entity recognition. In Pacific Symposium on Biocomputing. 652–63 (2008). [PubMed]

21. Chiu JP, Nichols East. Named entity recognition with bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 2016;4:357–370. doi: ten.1162/tacl_a_00104. [CrossRef] [Google Scholar]

22. Leaman R, Islamaj Dogan R, Lu Z. DNorm: disease name normalization with pairwise learning to rank. Bioinformatics. 2013;29(22):2909–2917. doi: 10.1093/bioinformatics/btt474. [PMC free commodity] [PubMed] [CrossRef] [Google Scholar]

23. Lee, H.C., Y.Y. Hsu, and H.Y. Kao, AuDis: an automated CRF-enhanced disease normalization in biomedical text. Database (Oxford) (2016). [PMC free article] [PubMed]

24. Leaman R, Lu Z. TaggerOne: joint named entity recognition and normalization with semi-Markov Models. Bioinformatics. 2016;32(18):2839–2846. doi: x.1093/bioinformatics/btw343. [PMC gratuitous article] [PubMed] [CrossRef] [Google Scholar]

25. Gu, Y., et al. Domain-specific language model pretraining for biomedical tongue processing. arXiv preprint https://arxiv.org/abs/2007.15779 (2020).

26. Sutton C, McCallum A. An introduction to conditional random fields. Found. Trends Mach. Learn. 2012;iv(4):267–373. doi: 10.1561/2200000013. [CrossRef] [Google Scholar]

27. Hochreiter South, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–1780. doi: x.1162/neco.1997.9.8.1735. [PubMed] [CrossRef] [Google Scholar]

28. Bai B, et al. Learning to rank with (a lot of) word features. Inform. Retr. 2010;xiii(three):291–314. doi: ten.1007/s10791-009-9117-ix. [CrossRef] [Google Scholar]

29. Ashburner Yard, et al. Factor ontology: tool for the unification of biological science. Gene Ontology Consortium. Nat. Genet. 2000;25(ane):25–29. [PMC free article] [PubMed] [Google Scholar]

30. Liu H, et al. BioThesaurus: a web-based thesaurus of protein and gene names. Bioinformatics. 2006;22(1):103–105. doi: x.1093/bioinformatics/bti749. [PubMed] [CrossRef] [Google Scholar]

31. Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32(1):D267–D270. doi: ten.1093/nar/gkh061. [PMC complimentary article] [PubMed] [CrossRef] [Google Scholar]

33. Davis AP, et al. Comparative toxicogenomics database: a knowledgebase and discovery tool for chemical-gene-disease networks. Nucleic Acids Res. 2009;37(Database result):D786–D792. doi: x.1093/nar/gkn580. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

34. Doğan RI, Leaman R, Lu Z. NCBI illness corpus: a resource for disease name recognition and concept normalization. J. Biomed. Inform. 2014;47:1–ten. doi: 10.1016/j.jbi.2013.12.006. [PMC complimentary article] [PubMed] [CrossRef] [Google Scholar]

35. Li, J., et al. BioCreative 5 CDR job corpus: a resource for chemical disease relation extraction. Database (2016). [PMC gratis article] [PubMed]

36. Fundel Yard, Kuffner R, Zimmer R. RelEx–relation extraction using dependency parse trees. Bioinformatics. 2007;23(3):365–371. doi: 10.1093/bioinformatics/btl616. [PubMed] [CrossRef] [Google Scholar]

37. Lim KMK, et al. @ MInter: automated text-mining of microbial interactions. Bioinformatics. 2016;32(19):2981–2987. doi: 10.1093/bioinformatics/btw357. [PubMed] [CrossRef] [Google Scholar]

38. Zhao Z, et al. Drug drug interaction extraction from biomedical literature using syntax convolutional neural network. Bioinformatics. 2016;32(22):3444–3453. [PMC costless commodity] [PubMed] [Google Scholar]

39. Zhao Z, et al. A protein-protein interaction extraction approach based on deep neural network. Int. J. Data Min. Bioinform. 2016;15(2):145–164. doi: 10.1504/IJDMB.2016.076534. [CrossRef] [Google Scholar]

40. Zhang Y, et al. Drug-drug interaction extraction via hierarchical RNNs on sequence and shortest dependency paths. Bioinformatics. 2018;34(v):828–835. doi: 10.1093/bioinformatics/btx659. [PMC costless commodity] [PubMed] [CrossRef] [Google Scholar]

41. Weinzierl MA, Maldonado R, Harabagiu SM. The touch on of learning unified medical language system knowledge embeddings in relation extraction from biomedical texts. J. Am. Med. Inform. Assoc. 2020;27(10):1556–1567. doi: 10.1093/jamia/ocaa205. [PMC gratuitous article] [PubMed] [CrossRef] [Google Scholar]

42. Suarez-Paniagua V, et al. A 2-stage deep learning approach for extracting entities and relationships from medical texts. J. Biomed. Inform. 2019;99:103285. doi: 10.1016/j.jbi.2019.103285. [PubMed] [CrossRef] [Google Scholar]

43. Xu D, et al. DTMiner: identification of potential affliction targets through biomedical literature mining. Bioinformatics. 2016;32(23):3619–3626. [PMC complimentary article] [PubMed] [Google Scholar]

44. Kim J, Kim JJ, Lee H. An analysis of disease-gene human relationship from Medline abstracts by DigSee. Sci. Rep. 2017;seven:40154. doi: 10.1038/srep40154. [PMC costless article] [PubMed] [CrossRef] [Google Scholar]

45. Warikoo, N., Chang, Y. C., & Hsu, West. L. LBERT: Lexically-aware transformers based bidirectional encoder representation model for learning universal bio-entity relations. Bioinformatics (2020). [PubMed]

46. Brbic M, et al. The mural of microbial phenotypic traits and associated genes. Nucleic Acids Res. 2016;44(21):10074–10090. [PMC gratis article] [PubMed] [Google Scholar]

47. Herrero-Zazo K, et al. The DDI corpus: an annotated corpus with pharmacological substances and drug–drug interactions. J. Biomed. Inform. 2013;46(5):914–920. doi: 10.1016/j.jbi.2013.07.011. [PubMed] [CrossRef] [Google Scholar]

48. Segura-Bedmar I, Martinez P, Herrero-Zazo M. Lessons learnt from the DDIExtraction-2013 shared task. J. Biomed. Inform. 2014;51:152–164. doi: 10.1016/j.jbi.2014.05.007. [PubMed] [CrossRef] [Google Scholar]

49. Xiao, Thousand., & Liu, C. Semantic relation classification via hierarchical recurrent neural network with attention. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers (2016).

50. Mikolov, T., et al. Efficient estimation of give-and-take representations in vector space. arXiv preprint https://arxiv.org/abs/1301.3781 (2013).

51. Moen, Due south., & Ananiadou, T. S. S. Distributional semantics resources for biomedical text processing. In Proceedings of LBM. 39–44 (2013).

52. Wang, 50., et al. Relation classification via multi-level attention CNNS. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Book i: Long Papers). 2016.

53. Choi YS. TPEMatcher: a tool for searching in parsed text corpora. Knowl. Based Syst. 2011;24(8):1139–1150. doi: 10.1016/j.knosys.2011.04.009. [CrossRef] [Google Scholar]

54. Manning, C., et al. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Coming together of the Association for Computational Linguistics: Arrangement Demonstrations (2014).

55. Davis PB. Cystic fibrosis since 1938. Am. J. Respir. Crit. Care Med. 2006;173(v):475–482. doi: 10.1164/rccm.200505-840OE. [PubMed] [CrossRef] [Google Scholar]

56. Lim S, Lee K, Kang J. Drug drug interaction extraction from the literature using a recursive neural network. PLoS ONE. 2018;xiii(1):e0190926. doi: 10.1371/periodical.pone.0190926. [PMC free commodity] [PubMed] [CrossRef] [Google Scholar]

Manufactures from Scientific Reports are provided here courtesy of Nature Publishing Group

martinmitersell1936.blogspot.com

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7904816/