Data for : Poly(A) Dataset for PAS sequences and pseudo-PAS sequences Classification (fasta format)
Jankovic, Boris R.
Van Neste, Christophe
Bajic, Vladimir B.
KAUST DepartmentApplied Mathematics and Computational Science Program
Computational Bioscience Research Center
Computational Bioscience Research Center (CBRC)
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Permanent link to this recordhttp://hdl.handle.net/10754/656663
MetadataShow full item record
DescriptionThis Dataset contains DNA sequences of the human genome hg38 from GENCODE folder at EBI ftp server (ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh38.primary_assembly.genome.fa.gz) A-Positive set (PAS sequences) Using GENCODE annotation for poly(A) (ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.polyAs.gff3.gz) We selected poly(A) signal annotation. Using bedtools-slop option, we found regions extended 300 bp upstream and 300 bp downstream of the poly(A) hexamer. With the bedtools-getfasta option, we extracted 606 bp fasta sequences from these regions. After eliminating duplicates, we obtained 37’516 presumed true functional poly(A) signal (PAS) sequences. Sequences from this set will be denoted as positive. B- Negative set (pseudo-PAS sequences) For the negative set, we looked for regions extended outside the region covering 1’000 bp upstream and downstream of the positive poly(A) hexamer signal using bedtools-complement. Homer tool was used to find matches for the 12 most frequent human poly(A) variants. Since the number of matches was huge, sampling was used to select 37’516 pseudo-PAS sequences. Sampling was done from each chromosome proportionally to the lengths of the chromosomes and also to the expected frequency of the poly(A) variants. Out of these predictions, for each PAS hexamer, we selected the same number of pseudo-PAS sequences as in the positive set. Training and testing sets We selected randomly from each of the positive and negative datasets 20% of sequences for the independent test data. The testing set thus consisted of 15’020 sequences. The remaining data represented the training set that consisted of 60’012 sequences. Both datasets are balanced relative to the true PAS and pseudo-PAS sequences.
CitationAlbalawi, F., Chahid, A., Guo, X., Albaradei, S., Magana-Mora, A., Jankovic, B. R., Uludag, M., Van Neste, C., Essack, M., Laleg-Kirati, T.-M., & Bajic, V. B. (2018). Data for : Poly(A) Dataset for PAS sequences and pseudo-PAS sequences Classification (fasta format) [Data set]. KAUST Research Repository. https://doi.org/10.25781/KAUST-JHSGI
SponsorsThis work has been supported by the King Abdullah University of Science and Technology (KAUST) Base Research Fund (BAS/1/1606-01-01) to VBB, (BAS/1/1627-01-01) to TMLK, and KAUST Office of Sponsored Research (OSR) under Awards No CARF – FCC/1/1976-17-01.
PublisherKAUST Research Repository
RelationsIs Supplement To:
Albalawi F, Chahid A, Guo X, Albaradei S, Magana-Mora A, et al. (2019) Hybrid model for efficient prediction of poly(A) signals in human genomic DNA. Methods. DOI: 10.1016/j.ymeth.2019.04.001 HANDLE: 10754/631950
Albalawi, Fahad; Chahid, Abderrazak; Guo, Xingang; Albaradei, Somayah; Magaña Mora, Arturo ; Jankovic, Boris; Uludag, Mahmut; Van Neste, Christophe ; Essack, Magbubah; Laleg-Kirati, Taous Meriem; Bajic, Vladimir B. (2018), “Software of prediction models for 12 poly(A) signal variants in human and feature vectors of regions covering [-300,poly(A) hexamer,+300] for these 12 signal variants.”, Mendeley Data, v1 DOI: 10.17632/c495bkk9vf.1 HANDLE: 10754/656663.1
The following license files are associated with this item: