LabNotes and Random Thoughts : Plotting the read length frequency

Wednesday, August 12, 2015

Plotting the read length frequency

The source of fasta file is after merging R1s and R2s suing pandaseq

This one liner gives the length distribution and which can be plotted using gnuplot

infoseq -only -length -auto -heading N testV.fasta | sort | uniq -c | sort -k2nr| awk 'BEGIN{print"len\tfreq"}{printf("%s\t%s\n",$2,$1)}' >ldist

so I try doing it on R , even though the initial steps are comparatively slow compared to the Biopython

Total number of reads : 1160974 here is the code in R and the Plot.
Untitled

Ploting the length distribution of merged fasta files from pandaseq output¶

In [1]:

# source("http://bioconductor.org/biocLite.R")
# biocLite(Biostrings)

suppressMessages(library(Biostrings))
filename = 'testV.fasta'
sr = readDNAStringSet(filename, format="fasta")
freq=table(width(sr))
len=as.numeric(names(freq))

In [2]:

plot(range(width(sr)), range(as.numeric( freq )), type='n',
     xlab=" read length", 
     ylab="freq",
     main="Distribution of length of fasta")
lines(len, freq, type='b', lwd=.3, col='black', pch=18)

No comments:

Subscribe to: Post Comments (Atom)