Data for : Poly(A) Dataset for PAS sequences and pseudo-PAS sequences Classification (fasta format)
Type
DatasetAuthors
Albalawi, FahadChahid, Abderrazak

Guo, Xingang

Albaradei, Somayah

Magana-Mora, Arturo

Jankovic, Boris R.
Uludag, Mahmut
Van Neste, Christophe
Essack, Magbubah
Laleg-Kirati, Taous-Meriem

Bajic, Vladimir B.

KAUST Department
Applied Mathematics and Computational Science ProgramComputational Bioscience Research Center
Computational Bioscience Research Center (CBRC)
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Date
2018-11-15Permanent link to this record
http://hdl.handle.net/10754/656663
Metadata
Show full item recordDescription
This Dataset contains DNA sequences of the human genome hg38 from GENCODE folder at EBI ftp server (ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh38.primary_assembly.genome.fa.gz) A-Positive set (PAS sequences) Using GENCODE annotation for poly(A) (ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.polyAs.gff3.gz) We selected poly(A) signal annotation. Using bedtools-slop option, we found regions extended 300 bp upstream and 300 bp downstream of the poly(A) hexamer. With the bedtools-getfasta option, we extracted 606 bp fasta sequences from these regions. After eliminating duplicates, we obtained 37’516 presumed true functional poly(A) signal (PAS) sequences. Sequences from this set will be denoted as positive. B- Negative set (pseudo-PAS sequences) For the negative set, we looked for regions extended outside the region covering 1’000 bp upstream and downstream of the positive poly(A) hexamer signal using bedtools-complement. Homer tool was used to find matches for the 12 most frequent human poly(A) variants. Since the number of matches was huge, sampling was used to select 37’516 pseudo-PAS sequences. Sampling was done from each chromosome proportionally to the lengths of the chromosomes and also to the expected frequency of the poly(A) variants. Out of these predictions, for each PAS hexamer, we selected the same number of pseudo-PAS sequences as in the positive set. Training and testing sets We selected randomly from each of the positive and negative datasets 20% of sequences for the independent test data. The testing set thus consisted of 15’020 sequences. The remaining data represented the training set that consisted of 60’012 sequences. Both datasets are balanced relative to the true PAS and pseudo-PAS sequences.Citation
Albalawi, F., Chahid, A., Guo, X., Albaradei, S., Magana-Mora, A., Jankovic, B. R., Uludag, M., Van Neste, C., Essack, M., Laleg-Kirati, T.-M., & Bajic, V. B. (2018). Data for : Poly(A) Dataset for PAS sequences and pseudo-PAS sequences Classification (fasta format) [Data set]. KAUST Research Repository. https://doi.org/10.25781/KAUST-JHSGISponsors
This work has been supported by the King Abdullah University of Science and Technology (KAUST) Base Research Fund (BAS/1/1606-01-01) to VBB, (BAS/1/1627-01-01) to TMLK, and KAUST Office of Sponsored Research (OSR) under Awards No CARF – FCC/1/1976-17-01.Publisher
KAUST Research RepositoryRelations
Is Supplement To:- [Article]
Albalawi F, Chahid A, Guo X, Albaradei S, Magana-Mora A, et al. (2019) Hybrid model for efficient prediction of poly(A) signals in human genomic DNA. Methods. DOI: 10.1016/j.ymeth.2019.04.001 HANDLE: 10754/631950
- [Dataset]
Albalawi, Fahad; Chahid, Abderrazak; Guo, Xingang; Albaradei, Somayah; Magaña Mora, Arturo ; Jankovic, Boris; Uludag, Mahmut; Van Neste, Christophe ; Essack, Magbubah; Laleg-Kirati, Taous Meriem; Bajic, Vladimir B. (2018), “Software of prediction models for 12 poly(A) signal variants in human and feature vectors of regions covering [-300,poly(A) hexamer,+300] for these 12 signal variants.”, Mendeley Data, v1 DOI: 10.17632/c495bkk9vf.1 HANDLE: 10754/656663.1
ae974a485f413a2113503eed53cd6c53
10.25781/KAUST-JHSGI
Scopus Count
The following license files are associated with this item: