Recall DNA methylation levels at low coverage sites using a CNN model in WGBS

Ximei Luo; Yansu Wang; Quan Zou; Lei Xu

doi:10.1371/journal.pcbi.1011205

Abstract

DNA methylation is an important regulator of gene transcription. WGBS is the gold-standard approach for base-pair resolution quantitative of DNA methylation. It requires high sequencing depth. Many CpG sites with insufficient coverage in the WGBS data, resulting in inaccurate DNA methylation levels of individual sites. Many state-of-arts computation methods were proposed to predict the missing value. However, many methods required either other omics datasets or other cross-sample data. And most of them only predicted the state of DNA methylation. In this study, we proposed the RcWGBS, which can impute the missing (or low coverage) values from the DNA methylation levels on the adjacent sides. Deep learning techniques were employed for the accurate prediction. The WGBS datasets of H1-hESC and GM12878 were down-sampled. The average difference between the DNA methylation level at 12× depth predicted by RcWGBS and that at >50× depth in the H1-hESC and GM2878 cells are less than 0.03 and 0.01, respectively. RcWGBS performed better than METHimpute even though the sequencing depth was as low as 12×. Our work would help to process methylation data of low sequencing depth. It is beneficial for researchers to save sequencing costs and improve data utilization through computational methods.

Author summary

DNA methylation has a major impact on gene regulation. WGBS is the gold standard for investigating the DNA methylation. The DNA methylation level of the sites with low coverage are often not accurate in WGBS datasets. Therefore, we proposed a method based on the CNN model to perform DNA methylation level interpolation for specific sites and named this method as RcWGBS. RcWGBS did not rely on other omics data or other cross-sample data. It only used the sites with sufficient coverage contained in the target WGBS dataset for model training to obtain parameters. Then, the trained model can be used to predict the DNA methylation level of sites with low coverage. Our analyses showed that RcWGBS could recalibrate the methylation level of some CpGs with insufficient coverage. It is suggested that our research could benefit the WGBS datasets with insufficient sequencing coverage. RcWGBS is implemented as an R-packages. It is efficient and convenient and does not need other WGBS or omics data.

Citation: Luo X, Wang Y, Zou Q, Xu L (2023) Recall DNA methylation levels at low coverage sites using a CNN model in WGBS. PLoS Comput Biol 19(6): e1011205. https://doi.org/10.1371/journal.pcbi.1011205

Editor: Ilya Ioshikhes, CANADA

Received: September 30, 2022; Accepted: May 22, 2023; Published: June 14, 2023

Copyright: © 2023 Luo et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All of the computer programs and scripts can be download form https://github.com/TracyHIT/RcWGBS/.

Funding: The work was supported in part by the National Natural Science Foundation of China (62250028, 62131004, to Q.Z.; 62202315 to X.L.), the Sichuan Provincial Science Fund for Distinguished Young Scholars (2021JDJQ0025 to Q.Z.), the Municipal Government of Quzhou (2022D040 to Q.Z.), the China Postdoctoral Science Foundation (2022M720662 to X.L.), the Foundation Project of Shenzhen Polytechnic (6022330002K to X.L.) and the Special Project in Key Field of Department of Education of Guangdong Province (2022ZDZX2082 to L.X.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

This is a PLOS Computational Biology Methods paper.

Introduction

Cytosine methylation is a widely conserved epigenetic mark with very important roles in many biological regulatory processes such as cell differentiation, development, and many diseases [1–7]. It is a covalent chemical modification that can alter and downregulate gene expression by stably affecting transcription factor binding [8–11]. Most DNA methylation occurs in the CpG dinucleotides [12–18]. Whole-genome bisulfite sequencing (WGBS) is a next-generation sequencing method that can detect and quantify DNA methylation at genome-wide base resolution [19–21]. The application of this technology has been instrumental in dissecting the molecular pathways by which DNA methylation controls gene expression dynamics by steering transcription factors. However, deep inter-and intraspecific WGBS measurements remain cost prohibitive, particularly for species with large genomes [22–24]. The NIH Roadmap Epigenomics Project currently recommends that the WGBS have at least 30× coverage with two replicates (http://www.roadmapepigenomics.org/protocols). Many published methylomes have therefore been sequenced far below saturation (i.e., a large number of cytosines in the genome are not covered, or the coverages are less than 3). Even if the coverage is sufficient, there are still many sites with coverage of less than 3. For example, the combined coverages of the WGBS data of GM12878 and H1-hESC in the ENCODE [25] database are 59.58X and 54.08×, respectively. However, approximately 4% of the CpG sites have coverages ≤ 3. It would be more serious accumulation effective if multiple groups of WGBS data were combined for further analysis [26, 27].

There is currently much interest in calling DNA methylation in single cells and in nanotechnology [28, 29]. In single-cell methylomes, there are sites not covered by any reads in some single cell. The DNA methylation status of other cells and DNA sequenced can be used as the features of the deep learning model to predict the methylation status [30–32]. However, in traditional bulk WGBS data, calling DNA methylation also has the problem of insufficient coverage. The lower the coverage of WGBS is, the lower the accuracy of the DNA methylation level. Interpolation, smoothing, and missing-value filling methods have been proposed to solve this problem, including METHimpute [33]. The HMM model was used to interpolate the DNA methylation level by taking all reads of CpG sites and the number of methylated reads of the entire genome as inputs. This method has been applied to plant genomes and has proved to be effective. This method requires the input of the whole DNA methylation chain for model training and prediction. In addition, DNA methylation has sequence characteristics, such as CpG-rich regions that are mainly unmethylated with a C+G content greater than 50% [34–37]. A number of studies have been conducted to predict the methylation status of CpG based on flanking sequences and TF binding motifs [37–39]. Wang et al. proposed DeepMethyl based on sequence and Hi-C data to predict the methylation state [40]. Wu et al. and Zhou et al. also proposed methods to predict DNA methylation status based on SVM using DNA sequences on their own set of benchmark data [41–47]. Only using DNA sequences to predict DNA methylation status can obtain good prediction results, but this approach can only be applied to specific datasets. In practical applications, although DNA sequences are consistent in different cells, DNA methylation levels are different, so other dynamic characteristics are required to predict DNA methylation levels dynamically [48–50]. Related methods have limited predictions on methylation states or are based on other omics data [42, 51–54]. Only METHimpute can be used to dynamically impute missing DNA methylation levels independent of other omics data. METHimpute uses DNA methylation level chains. Here, we found that DNA sequence characteristics and methylation levels on flanking regions can both be used for imputation. The WGBS sequencing coverage of the sites to be predicted is low, but the coverage of its flanking sites is available to predict the methylation of low-coverage sites.

In this study, we downsampled the original data and compared the DNA methylation after sampling with the original DNA methylation. It was found that the lower the coverage was, the greater the difference in DNA methylation level (as shown in Fig 1B). To maximize the information contained in WGBS data and to facilitate cost-effective sequencing decisions for future studies, we developed RcWGBS, a convolutional neural network (CNN)- based imputation algorithm for the construction of base pair resolution methylomes from WGBS data. The unique feature of this algorithm is its ability to impute the methylation level of cytosines with missing or uninformative coverage, thus yielding complete methylomes even with low-coverage WGBS datasets. Indeed, we downsampled the WGBS data of two cell lines and then used RcWGBS to speculate the DNA methylation data with low coverage after sampling. Then the DNA methylation level of the speculated WGBS was compared with the raw unsampled values. This method can effectively improve the accuracy of the DNA methylation level with low coverage.

Download:

Fig 1. The structure of the RcWGBS model and results by using the RcWGBS in H1-hESC and GM12878 datasets.

(A) The structure of the RcWGBS model. The DNA sequence and the DNA methylation levels were used as the input features. The 2-mer coding method was used to encode flanking DNA sequences centered on the sites with 50 bp upstream and downstream. Finally, the input feature of RcWGBS was a data matrix with a length of 100, a width of 5, and a height of 1. (B) The lower the coverage, the greater the difference between the DNA methylation level in the down-sampling and the original data. MEA means the mean absolute error. (C) Difference between predicted DNA methylation level and original DNA methylation level under different features. Y-axes represented the mean absolute error. (D) The mean absolute error of the imputed methylation calls in down-sampled H1-hESC and GM12878 data could be reduced. The blue dots represented the difference between the DNA methylation level of the down-sampled and the unsampled original dataset. While the yellow dots represented the difference between the DNA methylation level after RcWGBS interpolation and the unsampled original data. A total of 22 groups of data were compared here.

https://doi.org/10.1371/journal.pcbi.1011205.g001

Results

Conceptual overview

WGBS is an NGS-based method in which DNA is treated with sodium bisulfite before sequencing to convert unmethylated cytosines into uracils and ultimately into thymines during PCR amplification. Hence, a cytosine in a bisulfite-treated read that maps to a cytosine in the reference genome provides evidence for methylation, while a thymine that maps to a cytosine does not. The DNA methylation level is defined as the number of methylated reads covering a specific site divided by the total number of reads. At a specific CG site, the methylation level is defined as the number of reads with methylated cytosines divided by the total number of reads covering that site. In the actual experiment, there were sites that were not fully covered, so the DNA methylation level could not be calculated effectively. To overcome these limitations, we developed RcWGBS, a CNN-based approach to impute missing values from WGBS. The binding sites of transcription factors are affected by DNA methylation. Therefore, we can assume that in a cell state, the level of DNA methylation is related to the DNA sequence pattern. In addition, the distribution of DNA methylation has the characteristics of spatial distribution [55, 56]. RcWGBS takes methylation level chains from Bismark or other apps as input by integrating the DNA sequence information. The outputs are recalibrated methylation levels between 0 and 1 for every cytosine in the genome.

For the DNA sequence, we selected the sequence centered on the sites with 50 bp upstream and downstream. In this study, we tested two coding methods, one-hot and 2-mer [57]. The 2-mer method carries more sequence information than the one-hot method, so it performs better than the one-hot method. Therefore, in this study, the 2-mer coding method was used to encode DNA sequences. Sixteen kinds of 2 bp subsequences can be composed of 4 bases, which can be expressed by the numbers of 0–15. Then the decimal data are binarily converted, and finally, they can be expressed as a vector with a length of 4 and only containing 0 and 1 (as shown in Fig 1A). Since the DNA methylation level of adjacent regions is consistent, the DNA methylation level of the region adjacent to the predicted site can also be used as a feature. Here, the DNA methylation levels of 50 sites upstream and downstream of the site to be predicted were used. Finally, the input feature of the model was a data matrix with a length of 100, a width of 5, and a height of 1. The RcWGBS was based on a convolutional neural network. The model performs the first feature extraction through a 5×5 two-dimensional convolution kernel. After pooling, two one-dimensional convolutions are performed again to enhance feature extraction [58, 59]. Then after the full connection, a final output value of 0–1 was used to infer DNA methylation. The overall structure of the RcWGBS model is shown in Fig 1A.

Feature combination and selection of DNA sequence representation

For the data used to build the model, we first downsampled the readings after WGBS alignment of the GM12878 and H1-hESC datasets and sampled 90%, 70%, 50%, 30%, and 10%, respectively [60]. The coverage after sampling is shown in Table 1, and the minimum coverage was 2.54×(per cytosine, double-stranded). We used DNA methylation chain, DNA methylation chain combined with DNA sequence encoded by one-hot, and DNA methylation combined with DNA sequence encoded by 2-mer as model input. In addition, in experiments, the average methylation level on both sides of adjacent sites is often used as the methylation level of sites to be estimated. We compared the predicted results of these three combined features input into the CNN model with the average methylation levels on both sides of the adjacent sites. The mean absolute error between the predicted results and the results of unsampled data was used as the evaluation index. The mean absolute error (MAE) is defined as: where m_i and represent the true and predicted values of DNA methylation, respectively.

Download:

Table 1. Coverage of down-sampled data.

https://doi.org/10.1371/journal.pcbi.1011205.t001

We selected the training data based on the statistics of coverage. DNA methylation levels at CpG sites with coverage between the median and the third quartile are considered relatively accurate. These loci were selected for the training model. In the process of selecting the feature representation, 100,000 sites were used as the training set. The independent test dataset was 100,000 sites from other sites that were selected randomly and not included in the training set. The results are shown in Fig 1C. We found that DNA sequence features improved prediction significantly versus using only the neighbor DNA methylation levels. The DNA sequence encoded by the 2-mer combined with the DNA methylation chain as an input feature can obtain the best results. Finally, the input features of the model were DNA sequences and DNA methylation chains represented by 2-mers. All of the computer programs and scripts can be downloaded from https://github.com/TracyHIT/RcWGBS/.

Imputation of the downsampled H1-hESC and GM12878 methylomes

To demonstrate the performance of the RcWGBS, we analyzed WGBS data with different coverage. Additionally, using the coverage statistics, 100,000 CpGs with coverages are at the median, and the third quartile were randomly selected as the training set. Then, other sites with insufficient coverage or less than three were interpolated. We found that RcWGBS could produce high-quality interpolation and correction for methylation calls with different coverages. In the downsampled data, we counted the changes in the methylation levels of CpG sites with coverage less than three but with coverage greater than ten in the unsampled data. Indeed, on CpG sites with insufficient coverage, the MAE of the imputed methylation calls in downsampled H1-hESC and GM12878 data could be reduced. As shown in Fig 1D, each point represents the MAE of each chromosome. Here only the sites with insufficient coverage or less than three were counted. In the H1-hESC dataset, the mean absolute error in the DNA methylation level between downsampled data and the original data was greater than 0.158, while in the GM12878 dataset, the mean absolute error was greater than 0.226. This difference was significantly reduced after using RcWGBS in the two datasets. Among the sites to be compared (sites with insufficient coverage lower than three in the down-sampled data were counted), with the increase in the sequencing depth, the difference between the coverage in the down-sampled data and the original data became larger. So, the error between the DNA methylation level obtained from the sampled data and the original data gradually increased (as shown by the blue dots in Fig 1D). By RcWGBS prediction, MAE had shrunk by 0.037 and 0.065 on average, compared with downsampled dataset in GM12878 and H1-hESC, respectively. Other results were better than these, collectively proving the effectiveness of RcWGBS.

Comparison with METHimpute and BSmooth

For WGBS data, METHimpute and BSmooth have been proposed. METHimpute is a method based on the HMM model to infer DNA methylation from insufficient sequencing. BSmooth is the popular smoothing-based method. We used METHimpute and BSmooth to interpolate the methylation levels of the downsampled data at base pair resolution. The methylated read number and the total read number at every site were input into METHimpute and BSmooth. METHimpute assumes that there are two distributions of DNA methylation levels and re-estimates each site’s methylation level. In the downsampled data, the DNA methylation level at the sites with high coverage was more accurate. Therefore, the sites with sufficient coverage were used to train the CNN model. Only the sites with low coverage were re-estimated in RcWGBS. As shown in Fig 2A and listed in S1 Table, in the H1-hESC dataset, the mean absolute error of the DNA methylation level between the original data and downsampling data estimated by METHimpute and BSmooth was greater than 0.05, while in GM12878, the mean absolute error was greater than 0.14 (as listed in S2 Table). This difference was significantly reduced after using RcWGBS compared with using METHimpute. By RcWGBS, the mean absolute errors were reduced to less than 0.01 and 0.05 in the H1-hESC and GM12878 datasets, respectively. To evaluate prediction accuracy, we also calculated the Pearson’s correlation coefficient between the raw unsampled data and predicted values using RcWGBS and METHimpute. We found that when the coverage was too low, the correlation coefficient between the predicted value and the true value of RcWGBS was reduced and lower than that of METHimpute (as shown in Fig 2B). In the H1-hESC and GM12878 datasets, when the coverages were approximately 12.05 and 12.31, respectively, the correlation coefficients between the predicted value of RcWGBS and the unsampled value were higher than those of METHimpute and BSmooth (as listed in S3 Table and S4 Table). With increasing coverage, the accuracy of RcWGBS increased. The correlation coefficient of METHimpute was relatively stable. However, when the coverage was high, the correlation coefficient was lower than the prediction result of RcWGBS. These results prove that the RcWGBS is better than METHimpute.

Download:

Fig 2. Comparison with METHimpute and BSmooth.

(A) The mean absolute error of the DNA methylation level between raw unsampled data and predicted values from RcWGBS, METHimpute, and BSmooth, respectively. (B) The pearson’s correlation coefficient between the raw unsampled data and predicted values from RcWGBS, METHimpute, and BSmooth, respectively.

https://doi.org/10.1371/journal.pcbi.1011205.g002

Discussion

WGBS is considered to be the "gold standard" for single-base resolution measurement of DNA methylation levels. However, WGSB often requires high sequencing depth. Some sites with insufficient coverage are observed in WGBS data. The DNA methylation levels of these sites were often not accurate [61]. These sites would affect further analysis in subsequent analysis, such as calling differential methylation sites and DNA methylation biomarkers for the disease. Therefore, it’s very important to obtain accurate DNA methylation levels on sites with insufficient coverage. A large number of studies have shown that DNA methylation level has spatial distribution characteristics and DNA sequence characteristics, which is consistent with the DNA methylation level of flanking sites [34–37]. Therefore, a large number of methods have been proposed to predict the DNA methylation state or level. But most of them need the other omics data [42, 51–54]. In addition, there are some methods that only use DNA sequences to predict DNA methylation status on the benchmark dataset [37–39]. As DNA methylation is dynamic, these prediction methods without any dynamic data seems unreasonable. Therefore, many methods cannot effectively predict DNA methylation levels in low-coverage WGBS datasets.

In 2018, the METHimpute method was proposed [33]. It used HMM model only based on DNA methylation characteristics. In this work, RcWGBS combined DNA sequence and DNA methylation information and took advantage of the CNN model in information extraction. It used the DNA sequence and DNA methylation levels on both sides of the site as features. For the RcWGBS, it was not necessary to provide the entire DNA methylation data chain. When predicting minority points, it only needs to provide DNA methylation and DNA sequence on both sides of the site to be predicted by using RcWGBS. Through the application in the H1-hESC and GM12878 datasets, we proved that the RcWGBS performed better than METHimpute.

In addition, in the METHimpute model, only two states of DNA methylation were considered. Therefore, the interpolated DNA methylation level is mainly distributed in the two regions close to 0 and 1, resulting in some DNA methylation near 0.5 being overestimated or underestimated. Although the correlation coefficient is higher than RcWGBS in an extremely low coverage profile, the MAE of METHimpute is lower than RcWGBS. In the RcWGBS model, when a large number of sites (>1 million) need to be predicted, the pre-processing time of RcWGBS is large, and the upstream and downstream DNA sequences of the sites to be predicted need to be extracted. In order to solve this problem, the reference genome can be used to extract the sequences on both sides of all CpG sites in advance. In the R package download link provided in this article, the data matrix of 50bp sequences upstream and downstream on both sides of all CpG sites of GRch38 has been provided.

It is noteworthy to mention that the applicability of RcWGBS in single-cell sequencing using WGBS data has been further investigated. By modifying the loss function and optimization method during the model training process, RcWGBS can be adapted into a prediction classification model, enabling accurate prediction of DNA methylation in single-cell WGBS.

Materials and methods

Downsampling data preparation

The WGBS sequencing data used in this study were downloaded from ENCODE [25]. For the H1-hESC dataset, files numbered ENCFF003FWN and ENCFF546TLK were downloaded. For the GM12878 dataset, files numbered ENCFF857QML and ENCFF681ASN were downloaded. Then the four files were randomly down-sampled to different degrees by samtools, such as 90%, 70%, 50%, 30%, and 10%. Randomly sampled the reads of the raw data directly. Downsampling the pair-end sequencing files cannot guarantee that the reads were sampled in pair-end. Therefore, the changes in coverage and adoption ratio were different. The independent test dataset was constructed by covering the sites selected as the training data. The training sites were randomly sampled from the whole genome. It ensured the uniformity of the training sample on the whole genome. For the documents produced after sampling, Bismark [62] was used to extract the DNA methylation level. Then, the two repetitions in the experiment were merged by combining the numbers of methylated and unmethylated reads in the two repetitions. Since CpG is symmetric, the DNA methylation levels on the negative chain and positive chain are combined. Methylation values for each CpG site were quantified by m, which is the fraction of methylated reads over the total reads: where Meth_C and Unmeth_C represent the methylated and unmethylated reads called by bismark.

CNN for methylation calling

A CNN model with multiple convolutional and pooling layers and two fully connected hidden layers were used to extract features from high-dimensional inputs. The whole model was convoluted three times and pooled after each convolution. The kernel of the first convolution calculation was 5×5 and the step size was 1. The kernel function of the second and third convolution calculations was 3×1 and the step size was 1. The input was a 100bp long DNA sequence and DNA methylation chain centered on the target CpG site. The DNA methylation chain consisted of the methylation levels of 100 CpGs upstream and downstream. 100 2-mers can be generated from DNA sequence. The step size was set as 1, There was an overlay of one base between two adjacent 2mers. One-hot and 2-mer coding methods were used for DNA sequence representation. In the one-hot coding process, [[0,0,0,1], [0,0,1,0], [0,1,0,0], [1,0,0,0]] were used to for encoding the four different nucleotides of A, C, G and T. In the 2-mer coding process, 0–3 binary representations were used for the four nucleotides, and then 2mer directly splices the corresponding binary coding of the two bases. As Fig 1A shows, this encoding method is equivalent to encode AA, AC,…, TT corresponding to the binary encoding of 0,1,…, 15. Finally, for a specific target site, the input was a matrix s with the 5 rows, 100 columns and 1 channel. s was first transformed by a 2d-convoluntional layer, which computed the activations a_fi of a convolutional filter f at every position i in a matrix s:

Here the w_f was the weight matrix of convolutional filter f of length L and wide D. The input channel was one. The input row number was 5. The D was set as 5. The first convolution kernel was 5×5. Here the L was set as 5. A pooling layer was used to summarize the activations of p adjacent neurons by their maximum value P:

Here p was set as 2 or 3. After the first convolution and pooling, the 2d-convolutional layer degenerated to 1d-convolutional layer. Two fully connected layers were used in the model. The first fully connected layer converted a matrix with a size of 21×1×128 into a matrix with a size of 10×1×128, and the second fully connected layer mapped the matrix to the final predicted value of DNA methylation. This model was implemented in R language. The model was built using the “keras” package. The loss function of the CNN model was the mean squared error (MSE): where m_i and were the experimental and predicted DNA methylation levels of the ith CpG site, respectively. The model parameters were fitted with the Adam algorithm. The DNA methylation of sites with sufficient coverage was more accurate. When training the data used in the model, sites with appropriate coverage were used. We randomly selected 100,000 CpGs with coverage at the median and the third quartile as the training set. Epoch parameters are essential in the process of model training to prevent overfitting. Therefore, the validation dataset was divided from the training set during training. In each round, the losses of the training set and the verification set were calculated. In addition, the mean absolute errors (MAE) of the training set and verification sets were also calculated. The smaller the value was, the better the fitting effect. It was convenient for users to intuitively the optimal parameters intuitively. After each round of training, a visual figure of the MSE and MAE was output. Then, the optimal epoch parameters were set according to the figure.

Comparison with METHimpute and BSmooth

The METHimpute method used the number of reads covered by methylation and the total number of covers as inputs. According to METHimpute’s user guide, the prediction results of points with a posteriormax greater than 0.98 were selected. In the BSmooth, default parameter was used. The predicted methylation level of the CpG point was compared with the DNA methylation level of the original WGBS data. We calculated the mean absolute error and Pearson’s correlation coefficient to evaluate the prediction accuracy: where m_i and were the experimental and predicted DNA methylation levels of the ith CpG site, respectively. and were the means of the experimental and predicted methylation levels. σ_m and σ_m′ were the standard deviations of m_i and .

Supporting information

S1 Table. The mean absolute error of the DNA methylation level between raw unsampled data and predicted values from RcWGBS, METHimpute and BSmooth in GM12878.

https://doi.org/10.1371/journal.pcbi.1011205.s001

(XLSX)

S2 Table. The mean absolute error of the DNA methylation level between raw unsampled data and predicted values from RcWGBS, METHimpute and BSmooth in H1-hESC.

https://doi.org/10.1371/journal.pcbi.1011205.s002

(XLSX)

S3 Table. The pearson correlation coefficient of the DNA methylation level between raw unsampled data and predicted values from RcWGBS, METHimpute and BSmooth in GM12878.

https://doi.org/10.1371/journal.pcbi.1011205.s003

(XLSX)

S4 Table. The pearson correlation coefficient of the DNA methylation level between raw unsampled data and predicted values from RcWGBS, METHimpute and BSmooth in H1-hESC.

https://doi.org/10.1371/journal.pcbi.1011205.s004

(XLSX)

References

1. Morris BJ, Willcox BJ, Donlon TA. Genetic and epigenetic regulation of human aging and longevity. Biochim Biophys Acta Mol Basis Dis. 2019;1865(7):1718–44. pmid:31109447
- View Article
- PubMed/NCBI
- Google Scholar
2. Ahmed AA, Essa MEA. Potential of epigenetic events in human thyroid cancer. Cancer Genet. 2019;239:13–21. pmid:31472323
- View Article
- PubMed/NCBI
- Google Scholar
3. Baylin SB. Tying it all together: epigenetics, genetics, cell cycle, and cancer. Science. 1997;277(5334):1948–9. pmid:9333948
- View Article
- PubMed/NCBI
- Google Scholar
4. Tang W, Wan S, Yang Z, Teschendorff AE, Zou Q. Tumor origin detection with tissue-specific miRNA and DNA methylation markers. Bioinformatics. 2018;34(3):398–406. pmid:29028927
- View Article
- PubMed/NCBI
- Google Scholar
5. Zhang S, Zhang J, Zhang Q, Liang Y, Du Y, Wang G. Identification of Prognostic Biomarkers for Bladder Cancer Based on DNA Methylation Profile. Frontiers in cell and developmental biology. 2021;9:817086. pmid:35174173
- View Article
- PubMed/NCBI
- Google Scholar
6. Zhang S, Wang Y, Gu Y, Zhu J, Ci C, Guo Z, et al. Specific breast cancer prognosis-subtype distinctions based on DNA methylation patterns. Mol Oncol. 2018;12(7):1047–60. pmid:29675884
- View Article
- PubMed/NCBI
- Google Scholar
7. Yu L, Wang M, Yang Y, Xu F, Zhang X, Xie F, et al. Predicting therapeutic drugs for hepatocellular carcinoma based on tissue-specific pathways. PLoS Comput Biol. 2021;17(2):e1008696. pmid:33561121
- View Article
- PubMed/NCBI
- Google Scholar
8. Scarano MI, Strazzullo M, Matarazzo MR, D’Esposito M. DNA methylation 40 years later: Its role in human health and disease. J Cell Physiol. 2005;204(1):21–35. pmid:15648089
- View Article
- PubMed/NCBI
- Google Scholar
9. Stadler MB, Murr R, Burger L, Ivanek R, Lienert F, Scholer A, et al. DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature. 2011;480(7378):490–5. pmid:22170606
- View Article
- PubMed/NCBI
- Google Scholar
10. Zeng X, Tu X, Liu Y, Fu X, Su Y. Toward better drug discovery with knowledge graph. Current opinion in structural biology. 2022;72:114–26. pmid:34649044
- View Article
- PubMed/NCBI
- Google Scholar
11. Song B, Luo X, Luo X, Liu Y, Niu Z, Zeng X. Learning spatial structures of proteins improves protein–protein interaction prediction. Briefings in Bioinformatics. 2022;23(2):bbab558. pmid:35018418
- View Article
- PubMed/NCBI
- Google Scholar
12. Ramsahoye BH, Biniszkiewicz D, Lyko F, Clark V, Bird AP, Jaenisch R. Non-CpG methylation is prevalent in embryonic stem cells and may be mediated by DNA methyltransferase 3a. Proc Natl Acad Sci U S A. 2000;97(10):5237–42. pmid:10805783
- View Article
- PubMed/NCBI
- Google Scholar
13. Rivenbark AG, Stolzenburg S, Beltran AS, Yuan X, Rots MG, Strahl BD, et al. Epigenetic reprogramming of cancer cells via targeted DNA methylation. Epigenetics. 2012;7(4):350–60. pmid:22419067
- View Article
- PubMed/NCBI
- Google Scholar
14. Maunakea AK, Nagarajan RP, Bilenky M, Ballinger TJ, D’Souza C, Fouse SD, et al. Conserved role of intragenic DNA methylation in regulating alternative promoters. Nature. 2010;466(7303):253–7. pmid:20613842
- View Article
- PubMed/NCBI
- Google Scholar
15. Bock C, Paulsen M, Tierling S, Mikeska T, Lengauer T, Walter J. CpG island methylation in human lymphocytes is highly correlated with DNA sequence, repeats, and predicted DNA structure. PLoS Genet. 2006;2(3):e26. pmid:16520826
- View Article
- PubMed/NCBI
- Google Scholar
16. Yalcin D, Otu HH. An Unbiased Predictive Model to Detect DNA Methylation Propensity of CpG Islands in the Human Genome. Current Bioinformatics. 2021;16(2):179–96.
- View Article
- Google Scholar
17. Teng Z, Zhao Z, Li Y, Tian Z, Guo M, Lu Q, et al. i6mA-Vote: Cross-Species Identification of DNA N6-Methyladenine Sites in Plant Genomes Based on Ensemble Learning With Voting. Frontiers in plant science. 2022;13:845835. pmid:35237293
- View Article
- PubMed/NCBI
- Google Scholar
18. Luo X, Wang F, Wang G, Zhao Y. Identification of methylation states of DNA regions for Illumina methylation BeadChip. BMC genomics. 2020;21(Suppl 1):672. pmid:32138668
- View Article
- PubMed/NCBI
- Google Scholar
19. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature. 2009;462(7271):315–22. pmid:19829295
- View Article
- PubMed/NCBI
- Google Scholar
20. Cao C, Wang J, Kwok D, Cui F, Zhang Z, Zhao D, et al. webTWAS: a resource for disease candidate susceptibility genes identified by transcriptome-wide association study. Nucleic acids research. 2021;50(D1):D1123–D30.
- View Article
- Google Scholar
21. Ao C, Zou Q, Yu L. RFhy-m2G: Identification of RNA N2-methylguanosine modification sites based on random forest and hybrid features. Methods (San Diego, Calif). 2021. pmid:34033879
- View Article
- PubMed/NCBI
- Google Scholar
22. Niederhuth CE, Bewick AJ, Ji L, Alabady MS, Kim KD, Li Q, et al. Widespread natural variation of DNA methylation within angiosperms. Genome Biol. 2016;17(1):194. pmid:27671052
- View Article
- PubMed/NCBI
- Google Scholar
23. Zhang SY, Zhang SW, Fan XN, Zhang T, Meng J, Huang Y. FunDMDeep-m6A: identification and prioritization of functional differential m6A methylation genes. Bioinformatics. 2019;35(14):i90–i8. pmid:31510685
- View Article
- PubMed/NCBI
- Google Scholar
24. Laurent L, Wong E, Li G, Huynh T, Tsirigos A, Ong CT, et al. Dynamic changes in the human methylome during differentiation. Genome Res. 2010;20(3):320–31. pmid:20133333
- View Article
- PubMed/NCBI
- Google Scholar
25. Consortium EP. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004;306(5696):636–40. pmid:15499007
- View Article
- PubMed/NCBI
- Google Scholar
26. Horvath S, Raj K. DNA methylation-based biomarkers and the epigenetic clock theory of ageing. Nat Rev Genet. 2018;19(6):371–84. pmid:29643443
- View Article
- PubMed/NCBI
- Google Scholar
27. Yang Q, Li B, Tang J, Cui X, Wang Y, Li X, et al. Consistent gene signature of schizophrenia identified by a novel feature selection strategy from comprehensive sets of transcriptomic data. Brief Bioinform. 2020;21(3):1058–68. pmid:31157371
- View Article
- PubMed/NCBI
- Google Scholar
28. Zuo Y, Song M, Li H, Chen X, Cao P, Zheng L, et al. Analysis of the Epigenetic Signature of Cell Reprogramming by Computational DNA Methylation Profiles. Current Bioinformatics. 2020;15(6):589–99.
- View Article
- Google Scholar
29. Li H, Gong Y, Liu Y, Lin H, Wang G. Detection of transcription factors binding to methylated DNA by deep recurrent neural network. Briefings in bioinformatics. 2022;23(1). pmid:34962264
- View Article
- PubMed/NCBI
- Google Scholar
30. Angermueller C, Lee HJ, Reik W, Stegle O. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol. 2017;18(1):67. pmid:28395661
- View Article
- PubMed/NCBI
- Google Scholar
31. De Waele G, Clauwaert J, Menschaert G, Waegeman W. CpG Transformer for imputation of single-cell methylomes. Bioinformatics. 2022;38(3):597–603. pmid:34718418
- View Article
- PubMed/NCBI
- Google Scholar
32. Dodlapati S, Jiang Z, Sun J. Completing Single-Cell DNA Methylome Profiles via Transfer Learning Together With KL-Divergence. Front Genet. 2022;13:910439. pmid:35938031
- View Article
- PubMed/NCBI
- Google Scholar
33. Taudt A, Roquis D, Vidalis A, Wardenaar R, Johannes F, Colome-Tatche M. METHimpute: imputation-guided construction of complete methylomes from WGBS data. BMC Genomics. 2018;19(1):444. pmid:29879918
- View Article
- PubMed/NCBI
- Google Scholar
34. Tost J. DNA methylation: an introduction to the biology and the disease-associated changes of a promising biomarker. Mol Biotechnol. 2010;44(1):71–81. pmid:19842073
- View Article
- PubMed/NCBI
- Google Scholar
35. Lienert F, Wirbelauer C, Som I, Dean A, Mohn F, Schubeler D. Identification of genetic elements that autonomously determine DNA methylation states. Nat Genet. 2011;43(11):1091–7. pmid:21964573
- View Article
- PubMed/NCBI
- Google Scholar
36. Jones PA. Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat Rev Genet. 2012;13(7):484–92. pmid:22641018
- View Article
- PubMed/NCBI
- Google Scholar
37. Santoni D. The impact of flanking sequence features on DNA CpG methylation. Comput Biol Chem. 2021;92:107480. pmid:33826970
- View Article
- PubMed/NCBI
- Google Scholar
38. Liu Z, Xiao X, Qiu WR, Chou KC. iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition. Anal Biochem. 2015;474:69–77. pmid:25596338
- View Article
- PubMed/NCBI
- Google Scholar
39. Whitaker JW, Chen Z, Wang W. Predicting the human epigenome from DNA motifs. Nat Methods. 2015;12(3):265–72, 7 p following 72. pmid:25240437
- View Article
- PubMed/NCBI
- Google Scholar
40. Wang Y, Liu T, Xu D, Shi H, Zhang C, Mo YY, et al. Predicting DNA Methylation State of CpG Dinucleotide Using Genome Topological Features and Deep Networks. Sci Rep. 2016;6:19598. pmid:26797014
- View Article
- PubMed/NCBI
- Google Scholar
41. Zhou X, Li Z, Dai Z, Zou X. Prediction of methylation CpGs and their methylation degrees in human DNA sequences. Comput Biol Med. 2012;42(4):408–13. pmid:22209047
- View Article
- PubMed/NCBI
- Google Scholar
42. Wu C, Yao S, Li X, Chen C, Hu X. Genome-Wide Prediction of DNA Methylation Using DNA Composition and Sequence Complexity in Human. Int J Mol Sci. 2017;18(2). pmid:28212312
- View Article
- PubMed/NCBI
- Google Scholar
43. Zhang W, Spector TD, Deloukas P, Bell JT, Engelhardt BE. Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements. Genome Biol. 2015;16:14. pmid:25616342
- View Article
- PubMed/NCBI
- Google Scholar
44. Bhasin M, Zhang H, Reinherz EL, Reche PA. Prediction of methylated CpGs in DNA sequences using a support vector machine. FEBS Lett. 2005;579(20):4302–8. pmid:16051225
- View Article
- PubMed/NCBI
- Google Scholar
45. Zheng H, Wu H, Li J, Jiang SW. CpGIMethPred: computational model for predicting methylation status of CpG islands in human genome. BMC Med Genomics. 2013;6 Suppl 1:S13. pmid:23369266
- View Article
- PubMed/NCBI
- Google Scholar
46. Song B, Li F, Liu Y, Zeng XJBiB. Deep learning methods for biomedical named entity recognition: a survey and qualitative comparison. Briefings in Bioinformatics. 2021;22(6):bbab282. pmid:34308472
- View Article
- PubMed/NCBI
- Google Scholar
47. Cheng Y, Gong Y, Liu Y, Song B, Zou Q. Molecular design in drug discovery: a comprehensive review of deep generative models. Briefings in Bioinformatics. 2021;22(6). pmid:34415297
- View Article
- PubMed/NCBI
- Google Scholar
48. Yizhar-Barnea O, Valensisi C, Jayavelu ND, Kishore K, Andrus C, Koffler-Brill T, et al. DNA methylation dynamics during embryonic development and postnatal maturation of the mouse auditory sensory epithelium. Sci Rep. 2018;8(1):17348. pmid:30478432
- View Article
- PubMed/NCBI
- Google Scholar
49. Zhou Y, Zhang Y, Lian X, Li F, Wang C, Zhu F, et al. Therapeutic target database update 2022: facilitating drug discovery with enriched comparative data of targeted agents. Nucleic Acids Res. 2022;50(D1):D1398–D407. pmid:34718717
- View Article
- PubMed/NCBI
- Google Scholar
50. Yu L, Xia M, An Q. A network embedding framework based on integrating multiplex network for drug combination prediction. Briefings in bioinformatics. 2021.
- View Article
- Google Scholar
51. Kim S, Li M, Paik H, Nephew K, Shi H, Kramer R, et al. Predicting DNA methylation susceptibility using CpG flanking sequences. Pac Symp Biocomput. 2008:315–26. pmid:18229696
- View Article
- PubMed/NCBI
- Google Scholar
52. Fang F, Fan S, Zhang X, Zhang MQ. Predicting methylation status of CpG islands in the human brain. Bioinformatics. 2006;22(18):2204–9. pmid:16837523
- View Article
- PubMed/NCBI
- Google Scholar
53. Pan X, Lin X, Cao D, Zeng X, Yu PS, He L, et al. Deep learning for drug repurposing: Methods, databases, and applications. Wiley Interdisciplinary Reviews: Computational Molecular Science. 2022:e1597.
- View Article
- Google Scholar
54. Liu Y, Zhang X, Zou Q, Zeng X. Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers. Bioinformatics. 2021;37(11):1604–6. pmid:33112385
- View Article
- PubMed/NCBI
- Google Scholar
55. Fu T, Li F, Zhang Y, Yin J, Qiu W, Li X, et al. VARIDT 2.0: structural variability of drug transporter. Nucleic Acids Res. 2022;50(D1):D1417–D31. pmid:34747471
- View Article
- PubMed/NCBI
- Google Scholar
56. Ao C, Yu L, Zou Q. Prediction of bio-sequence modifications and the associations with diseases. Briefings in Functional Genomics. 2021;20(1):1–18. pmid:33313647
- View Article
- PubMed/NCBI
- Google Scholar
57. Hong J, Luo Y, Zhang Y, Ying J, Xue W, Xie T, et al. Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning. Brief Bioinform. 2020;21(4):1437–47. pmid:31504150
- View Article
- PubMed/NCBI
- Google Scholar
58. Hong J, Luo Y, Mou M, Fu J, Zhang Y, Xue W, et al. Convolutional neural network-based annotation of bacterial type IV secretion system effectors with enhanced accuracy and reduced false discovery. Brief Bioinform. 2020;21(5):1825–36. pmid:31860715
- View Article
- PubMed/NCBI
- Google Scholar
59. Wu X, Yu L. EPSOL: sequence-based protein solubility prediction using multidimensional embedding. Bioinformatics (Oxford, England). 2021. pmid:34145885
- View Article
- PubMed/NCBI
- Google Scholar
60. Li F, Zhou Y, Zhang Y, Yin J, Qiu Y, Gao J, et al. POSREG: proteomic signature discovered by simultaneously optimizing its reproducibility and generalizability. Brief Bioinform. 2022;23(2):bbac040. pmid:35183059
- View Article
- PubMed/NCBI
- Google Scholar
61. Shen Z, Zou Q. Basic polar and hydrophobic properties are the main characteristics that affect the binding of transcription factors to methylation sites. Bioinformatics. 2020;36(15):4263–8 pmid:32399547
- View Article
- PubMed/NCBI
- Google Scholar
62. Krueger F, Andrews SR. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics. 2011;27(11):1571–2. pmid:21493656
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Morris BJ, Willcox BJ, Donlon TA. Genetic and epigenetic regulation of human aging and longevity. Biochim Biophys Acta Mol Basis Dis. 2019;1865(7):1718–44. pmid:31109447
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Ahmed AA, Essa MEA. Potential of epigenetic events in human thyroid cancer. Cancer Genet. 2019;239:13–21. pmid:31472323
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Baylin SB. Tying it all together: epigenetics, genetics, cell cycle, and cancer. Science. 1997;277(5334):1948–9. pmid:9333948
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Tang W, Wan S, Yang Z, Teschendorff AE, Zou Q. Tumor origin detection with tissue-specific miRNA and DNA methylation markers. Bioinformatics. 2018;34(3):398–406. pmid:29028927
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Zhang S, Zhang J, Zhang Q, Liang Y, Du Y, Wang G. Identification of Prognostic Biomarkers for Bladder Cancer Based on DNA Methylation Profile. Frontiers in cell and developmental biology. 2021;9:817086. pmid:35174173
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Zhang S, Wang Y, Gu Y, Zhu J, Ci C, Guo Z, et al. Specific breast cancer prognosis-subtype distinctions based on DNA methylation patterns. Mol Oncol. 2018;12(7):1047–60. pmid:29675884
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Yu L, Wang M, Yang Y, Xu F, Zhang X, Xie F, et al. Predicting therapeutic drugs for hepatocellular carcinoma based on tissue-specific pathways. PLoS Comput Biol. 2021;17(2):e1008696. pmid:33561121
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Scarano MI, Strazzullo M, Matarazzo MR, D’Esposito M. DNA methylation 40 years later: Its role in human health and disease. J Cell Physiol. 2005;204(1):21–35. pmid:15648089
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Stadler MB, Murr R, Burger L, Ivanek R, Lienert F, Scholer A, et al. DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature. 2011;480(7378):490–5. pmid:22170606
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref10] 10. Zeng X, Tu X, Liu Y, Fu X, Su Y. Toward better drug discovery with knowledge graph. Current opinion in structural biology. 2022;72:114–26. pmid:34649044
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref11] 11. Song B, Luo X, Luo X, Liu Y, Niu Z, Zeng X. Learning spatial structures of proteins improves protein–protein interaction prediction. Briefings in Bioinformatics. 2022;23(2):bbab558. pmid:35018418
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref12] 12. Ramsahoye BH, Biniszkiewicz D, Lyko F, Clark V, Bird AP, Jaenisch R. Non-CpG methylation is prevalent in embryonic stem cells and may be mediated by DNA methyltransferase 3a. Proc Natl Acad Sci U S A. 2000;97(10):5237–42. pmid:10805783
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref13] 13. Rivenbark AG, Stolzenburg S, Beltran AS, Yuan X, Rots MG, Strahl BD, et al. Epigenetic reprogramming of cancer cells via targeted DNA methylation. Epigenetics. 2012;7(4):350–60. pmid:22419067
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref14] 14. Maunakea AK, Nagarajan RP, Bilenky M, Ballinger TJ, D’Souza C, Fouse SD, et al. Conserved role of intragenic DNA methylation in regulating alternative promoters. Nature. 2010;466(7303):253–7. pmid:20613842
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref15] 15. Bock C, Paulsen M, Tierling S, Mikeska T, Lengauer T, Walter J. CpG island methylation in human lymphocytes is highly correlated with DNA sequence, repeats, and predicted DNA structure. PLoS Genet. 2006;2(3):e26. pmid:16520826
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref16] 16. Yalcin D, Otu HH. An Unbiased Predictive Model to Detect DNA Methylation Propensity of CpG Islands in the Human Genome. Current Bioinformatics. 2021;16(2):179–96.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref17] 17. Teng Z, Zhao Z, Li Y, Tian Z, Guo M, Lu Q, et al. i6mA-Vote: Cross-Species Identification of DNA N6-Methyladenine Sites in Plant Genomes Based on Ensemble Learning With Voting. Frontiers in plant science. 2022;13:845835. pmid:35237293
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref18] 18. Luo X, Wang F, Wang G, Zhao Y. Identification of methylation states of DNA regions for Illumina methylation BeadChip. BMC genomics. 2020;21(Suppl 1):672. pmid:32138668
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref19] 19. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature. 2009;462(7271):315–22. pmid:19829295
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref20] 20. Cao C, Wang J, Kwok D, Cui F, Zhang Z, Zhao D, et al. webTWAS: a resource for disease candidate susceptibility genes identified by transcriptome-wide association study. Nucleic acids research. 2021;50(D1):D1123–D30.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref21] 21. Ao C, Zou Q, Yu L. RFhy-m2G: Identification of RNA N2-methylguanosine modification sites based on random forest and hybrid features. Methods (San Diego, Calif). 2021. pmid:34033879
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref22] 22. Niederhuth CE, Bewick AJ, Ji L, Alabady MS, Kim KD, Li Q, et al. Widespread natural variation of DNA methylation within angiosperms. Genome Biol. 2016;17(1):194. pmid:27671052
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref23] 23. Zhang SY, Zhang SW, Fan XN, Zhang T, Meng J, Huang Y. FunDMDeep-m6A: identification and prioritization of functional differential m6A methylation genes. Bioinformatics. 2019;35(14):i90–i8. pmid:31510685
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref24] 24. Laurent L, Wong E, Li G, Huynh T, Tsirigos A, Ong CT, et al. Dynamic changes in the human methylome during differentiation. Genome Res. 2010;20(3):320–31. pmid:20133333
View Article
PubMed/NCBI
Google Scholar

[92] View Article

[93] PubMed/NCBI

[94] Google Scholar

[ref25] 25. Consortium EP. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004;306(5696):636–40. pmid:15499007
View Article
PubMed/NCBI
Google Scholar

[96] View Article

[97] PubMed/NCBI

[98] Google Scholar

[ref26] 26. Horvath S, Raj K. DNA methylation-based biomarkers and the epigenetic clock theory of ageing. Nat Rev Genet. 2018;19(6):371–84. pmid:29643443
View Article
PubMed/NCBI
Google Scholar

[100] View Article

[101] PubMed/NCBI

[102] Google Scholar

[ref27] 27. Yang Q, Li B, Tang J, Cui X, Wang Y, Li X, et al. Consistent gene signature of schizophrenia identified by a novel feature selection strategy from comprehensive sets of transcriptomic data. Brief Bioinform. 2020;21(3):1058–68. pmid:31157371
View Article
PubMed/NCBI
Google Scholar

[104] View Article

[105] PubMed/NCBI

[106] Google Scholar

[ref28] 28. Zuo Y, Song M, Li H, Chen X, Cao P, Zheng L, et al. Analysis of the Epigenetic Signature of Cell Reprogramming by Computational DNA Methylation Profiles. Current Bioinformatics. 2020;15(6):589–99.
View Article
Google Scholar

[108] View Article

[109] Google Scholar

[ref29] 29. Li H, Gong Y, Liu Y, Lin H, Wang G. Detection of transcription factors binding to methylated DNA by deep recurrent neural network. Briefings in bioinformatics. 2022;23(1). pmid:34962264
View Article
PubMed/NCBI
Google Scholar

[111] View Article

[112] PubMed/NCBI

[113] Google Scholar

[ref30] 30. Angermueller C, Lee HJ, Reik W, Stegle O. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol. 2017;18(1):67. pmid:28395661
View Article
PubMed/NCBI
Google Scholar

[115] View Article

[116] PubMed/NCBI

[117] Google Scholar

[ref31] 31. De Waele G, Clauwaert J, Menschaert G, Waegeman W. CpG Transformer for imputation of single-cell methylomes. Bioinformatics. 2022;38(3):597–603. pmid:34718418
View Article
PubMed/NCBI
Google Scholar

[119] View Article

[120] PubMed/NCBI

[121] Google Scholar

[ref32] 32. Dodlapati S, Jiang Z, Sun J. Completing Single-Cell DNA Methylome Profiles via Transfer Learning Together With KL-Divergence. Front Genet. 2022;13:910439. pmid:35938031
View Article
PubMed/NCBI
Google Scholar

[123] View Article

[124] PubMed/NCBI

[125] Google Scholar

[ref33] 33. Taudt A, Roquis D, Vidalis A, Wardenaar R, Johannes F, Colome-Tatche M. METHimpute: imputation-guided construction of complete methylomes from WGBS data. BMC Genomics. 2018;19(1):444. pmid:29879918
View Article
PubMed/NCBI
Google Scholar

[127] View Article

[128] PubMed/NCBI

[129] Google Scholar

[ref34] 34. Tost J. DNA methylation: an introduction to the biology and the disease-associated changes of a promising biomarker. Mol Biotechnol. 2010;44(1):71–81. pmid:19842073
View Article
PubMed/NCBI
Google Scholar

[131] View Article

[132] PubMed/NCBI

[133] Google Scholar

[ref35] 35. Lienert F, Wirbelauer C, Som I, Dean A, Mohn F, Schubeler D. Identification of genetic elements that autonomously determine DNA methylation states. Nat Genet. 2011;43(11):1091–7. pmid:21964573
View Article
PubMed/NCBI
Google Scholar

[135] View Article

[136] PubMed/NCBI

[137] Google Scholar

[ref36] 36. Jones PA. Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat Rev Genet. 2012;13(7):484–92. pmid:22641018
View Article
PubMed/NCBI
Google Scholar

[139] View Article

[140] PubMed/NCBI

[141] Google Scholar

[ref37] 37. Santoni D. The impact of flanking sequence features on DNA CpG methylation. Comput Biol Chem. 2021;92:107480. pmid:33826970
View Article
PubMed/NCBI
Google Scholar

[143] View Article

[144] PubMed/NCBI

[145] Google Scholar

[ref38] 38. Liu Z, Xiao X, Qiu WR, Chou KC. iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition. Anal Biochem. 2015;474:69–77. pmid:25596338
View Article
PubMed/NCBI
Google Scholar

[147] View Article

[148] PubMed/NCBI

[149] Google Scholar

[ref39] 39. Whitaker JW, Chen Z, Wang W. Predicting the human epigenome from DNA motifs. Nat Methods. 2015;12(3):265–72, 7 p following 72. pmid:25240437
View Article
PubMed/NCBI
Google Scholar

[151] View Article

[152] PubMed/NCBI

[153] Google Scholar

[ref40] 40. Wang Y, Liu T, Xu D, Shi H, Zhang C, Mo YY, et al. Predicting DNA Methylation State of CpG Dinucleotide Using Genome Topological Features and Deep Networks. Sci Rep. 2016;6:19598. pmid:26797014
View Article
PubMed/NCBI
Google Scholar

[155] View Article

[156] PubMed/NCBI

[157] Google Scholar

[ref41] 41. Zhou X, Li Z, Dai Z, Zou X. Prediction of methylation CpGs and their methylation degrees in human DNA sequences. Comput Biol Med. 2012;42(4):408–13. pmid:22209047
View Article
PubMed/NCBI
Google Scholar

[159] View Article

[160] PubMed/NCBI

[161] Google Scholar

[ref42] 42. Wu C, Yao S, Li X, Chen C, Hu X. Genome-Wide Prediction of DNA Methylation Using DNA Composition and Sequence Complexity in Human. Int J Mol Sci. 2017;18(2). pmid:28212312
View Article
PubMed/NCBI
Google Scholar

[163] View Article

[164] PubMed/NCBI

[165] Google Scholar

[ref43] 43. Zhang W, Spector TD, Deloukas P, Bell JT, Engelhardt BE. Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements. Genome Biol. 2015;16:14. pmid:25616342
View Article
PubMed/NCBI
Google Scholar

[167] View Article

[168] PubMed/NCBI

[169] Google Scholar

[ref44] 44. Bhasin M, Zhang H, Reinherz EL, Reche PA. Prediction of methylated CpGs in DNA sequences using a support vector machine. FEBS Lett. 2005;579(20):4302–8. pmid:16051225
View Article
PubMed/NCBI
Google Scholar

[171] View Article

[172] PubMed/NCBI

[173] Google Scholar

[ref45] 45. Zheng H, Wu H, Li J, Jiang SW. CpGIMethPred: computational model for predicting methylation status of CpG islands in human genome. BMC Med Genomics. 2013;6 Suppl 1:S13. pmid:23369266
View Article
PubMed/NCBI
Google Scholar

[175] View Article

[176] PubMed/NCBI

[177] Google Scholar

[ref46] 46. Song B, Li F, Liu Y, Zeng XJBiB. Deep learning methods for biomedical named entity recognition: a survey and qualitative comparison. Briefings in Bioinformatics. 2021;22(6):bbab282. pmid:34308472
View Article
PubMed/NCBI
Google Scholar

[179] View Article

[180] PubMed/NCBI

[181] Google Scholar

[ref47] 47. Cheng Y, Gong Y, Liu Y, Song B, Zou Q. Molecular design in drug discovery: a comprehensive review of deep generative models. Briefings in Bioinformatics. 2021;22(6). pmid:34415297
View Article
PubMed/NCBI
Google Scholar

[183] View Article

[184] PubMed/NCBI

[185] Google Scholar

[ref48] 48. Yizhar-Barnea O, Valensisi C, Jayavelu ND, Kishore K, Andrus C, Koffler-Brill T, et al. DNA methylation dynamics during embryonic development and postnatal maturation of the mouse auditory sensory epithelium. Sci Rep. 2018;8(1):17348. pmid:30478432
View Article
PubMed/NCBI
Google Scholar

[187] View Article

[188] PubMed/NCBI

[189] Google Scholar

[ref49] 49. Zhou Y, Zhang Y, Lian X, Li F, Wang C, Zhu F, et al. Therapeutic target database update 2022: facilitating drug discovery with enriched comparative data of targeted agents. Nucleic Acids Res. 2022;50(D1):D1398–D407. pmid:34718717
View Article
PubMed/NCBI
Google Scholar

[191] View Article

[192] PubMed/NCBI

[193] Google Scholar

[ref50] 50. Yu L, Xia M, An Q. A network embedding framework based on integrating multiplex network for drug combination prediction. Briefings in bioinformatics. 2021.
View Article
Google Scholar

[195] View Article

[196] Google Scholar

[ref51] 51. Kim S, Li M, Paik H, Nephew K, Shi H, Kramer R, et al. Predicting DNA methylation susceptibility using CpG flanking sequences. Pac Symp Biocomput. 2008:315–26. pmid:18229696
View Article
PubMed/NCBI
Google Scholar

[198] View Article

[199] PubMed/NCBI

[200] Google Scholar

[ref52] 52. Fang F, Fan S, Zhang X, Zhang MQ. Predicting methylation status of CpG islands in the human brain. Bioinformatics. 2006;22(18):2204–9. pmid:16837523
View Article
PubMed/NCBI
Google Scholar

[202] View Article

[203] PubMed/NCBI

[204] Google Scholar

[ref53] 53. Pan X, Lin X, Cao D, Zeng X, Yu PS, He L, et al. Deep learning for drug repurposing: Methods, databases, and applications. Wiley Interdisciplinary Reviews: Computational Molecular Science. 2022:e1597.
View Article
Google Scholar

[206] View Article

[207] Google Scholar

[ref54] 54. Liu Y, Zhang X, Zou Q, Zeng X. Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers. Bioinformatics. 2021;37(11):1604–6. pmid:33112385
View Article
PubMed/NCBI
Google Scholar

[209] View Article

[210] PubMed/NCBI

[211] Google Scholar

[ref55] 55. Fu T, Li F, Zhang Y, Yin J, Qiu W, Li X, et al. VARIDT 2.0: structural variability of drug transporter. Nucleic Acids Res. 2022;50(D1):D1417–D31. pmid:34747471
View Article
PubMed/NCBI
Google Scholar

[213] View Article

[214] PubMed/NCBI

[215] Google Scholar

[ref56] 56. Ao C, Yu L, Zou Q. Prediction of bio-sequence modifications and the associations with diseases. Briefings in Functional Genomics. 2021;20(1):1–18. pmid:33313647
View Article
PubMed/NCBI
Google Scholar

[217] View Article

[218] PubMed/NCBI

[219] Google Scholar

[ref57] 57. Hong J, Luo Y, Zhang Y, Ying J, Xue W, Xie T, et al. Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning. Brief Bioinform. 2020;21(4):1437–47. pmid:31504150
View Article
PubMed/NCBI
Google Scholar

[221] View Article

[222] PubMed/NCBI

[223] Google Scholar

[ref58] 58. Hong J, Luo Y, Mou M, Fu J, Zhang Y, Xue W, et al. Convolutional neural network-based annotation of bacterial type IV secretion system effectors with enhanced accuracy and reduced false discovery. Brief Bioinform. 2020;21(5):1825–36. pmid:31860715
View Article
PubMed/NCBI
Google Scholar

[225] View Article

[226] PubMed/NCBI

[227] Google Scholar

[ref59] 59. Wu X, Yu L. EPSOL: sequence-based protein solubility prediction using multidimensional embedding. Bioinformatics (Oxford, England). 2021. pmid:34145885
View Article
PubMed/NCBI
Google Scholar

[229] View Article

[230] PubMed/NCBI

[231] Google Scholar

[ref60] 60. Li F, Zhou Y, Zhang Y, Yin J, Qiu Y, Gao J, et al. POSREG: proteomic signature discovered by simultaneously optimizing its reproducibility and generalizability. Brief Bioinform. 2022;23(2):bbac040. pmid:35183059
View Article
PubMed/NCBI
Google Scholar

[233] View Article

[234] PubMed/NCBI

[235] Google Scholar

[ref61] 61. Shen Z, Zou Q. Basic polar and hydrophobic properties are the main characteristics that affect the binding of transcription factors to methylation sites. Bioinformatics. 2020;36(15):4263–8 pmid:32399547
View Article
PubMed/NCBI
Google Scholar

[237] View Article

[238] PubMed/NCBI

[239] Google Scholar

[ref62] 62. Krueger F, Andrews SR. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics. 2011;27(11):1571–2. pmid:21493656
View Article
PubMed/NCBI
Google Scholar

[241] View Article

[242] PubMed/NCBI

[243] Google Scholar

Figures

Abstract

Author summary

Introduction

Results

Conceptual overview

Feature combination and selection of DNA sequence representation

Imputation of the downsampled H1-hESC and GM12878 methylomes

Comparison with METHimpute and BSmooth

Discussion

Materials and methods

Downsampling data preparation

CNN for methylation calling

Comparison with METHimpute and BSmooth

Supporting information

S1 Table. The mean absolute error of the DNA methylation level between raw unsampled data and predicted values from RcWGBS, METHimpute and BSmooth in GM12878.

S2 Table. The mean absolute error of the DNA methylation level between raw unsampled data and predicted values from RcWGBS, METHimpute and BSmooth in H1-hESC.

S3 Table. The pearson correlation coefficient of the DNA methylation level between raw unsampled data and predicted values from RcWGBS, METHimpute and BSmooth in GM12878.

S4 Table. The pearson correlation coefficient of the DNA methylation level between raw unsampled data and predicted values from RcWGBS, METHimpute and BSmooth in H1-hESC.

References