Booom...........

LabNotes and Random Thoughts...............

Wednesday, August 12, 2015

Plotting the read length frequency

The source of fasta file is after merging R1s and R2s suing pandaseq

This one liner gives the length distribution and which can be plotted using gnuplot

infoseq -only -length -auto -heading N testV.fasta | sort | uniq -c | sort -k2nr| awk 'BEGIN{print"len\tfreq"}{printf("%s\t%s\n",$2,$1)}' >ldist

so I try doing it on R , even though the initial steps are comparatively slow  compared to the Biopython 

Total number of reads : 1160974 here is the code in R and the Plot.
Untitled

Ploting the length distribution of merged fasta files from pandaseq output

In [1]:
# source("http://bioconductor.org/biocLite.R")
# biocLite(Biostrings)

suppressMessages(library(Biostrings))
filename = 'testV.fasta'
sr = readDNAStringSet(filename, format="fasta")
freq=table(width(sr))
len=as.numeric(names(freq))
In [2]:
plot(range(width(sr)), range(as.numeric( freq )), type='n',
     xlab=" read length", 
     ylab="freq",
     main="Distribution of length of fasta")
lines(len, freq, type='b', lwd=.3, col='black', pch=18)

No comments: