Genomic Scan, Frequency & Distribution of SEED

Viewing 6 reply threads
  • Author
    • #8931

      I want to scan a whole genome (fasta format) for its 100% exact matches with a seed sequence which is 7-mer long. Blast (and that goes for any version as far as i know) NEVER, even with the correct parameters which adjust for word size and its e-value – gets ONLY the exact matches because of its extension and similarity methodology which focusses on NEAR-exact matches (similarity).

      It’s seems very elementary, I want to scan a fasta formatted file, determine its sequence length and mark the seed start and end of every exact match which will allow me to map a distribution along the genome and determine it’s frequency. However, I can’t find any protocol, method or website that points me into the right direction eventhough the principle is SO BASIC. And trust me I have been trying. I’m aware that this must be a piece-of-cake for somebody that has the perl/coding skills I, quite obviously, lack

      I would have given up and done it with notepad a long time ago if I didnt have so many different sequences, large genomes, EST’s etc to go through.

      Any help would be greatly appreciated, though i fear that the explanation will be too simple and humiliate me šŸ™‚

    • #80510

      so let me get this straight, you just want it to mark the ends/starts of exact matches?

    • #80523

      I don’t know what visualization system you’re using but something like this might work if I understood you correctly

      code :#!/usr/local/bin/perl

      #fastastream from
      use Bio::FastaStream;
      my $fasta = '/path/to/file.fasta';
      my $seq = Bio::FastaStream->new($fasta);

      # get sequence data
      my $sequence = $seq->getSequence();
      #get length of sequence
      my $length = $seq->getSequenceLength();

      #replace matchpattern with your query sequence
      $sequence =~ s/MATCHPATTERN/...../g;

      #output to file

      So basically this code just finds your pattern and then replaces it with period characters which would show up if you’re doing a pairwise comparison.

    • #80544

      Thank you for your time mith. I apologize for any ambiguity but we are nearly there I think, if only we could replace the "#replace matchpattern with your query sequence" into a print output which would include the start and end of the position of the seed (search sequence) in the whole sequence string.
      e.g. ID(occurrence)-Seed-Start-End-Length(FullSequenceLength)

      To illustrate it an example – we have a string of length 42:
      T G C G T C C T A C A A G A C A A C T A T G G G G C C G C G T T T T C G C C G C C A
      And lets say Iā€™m looking for a seed sequence of 3-mer, CGT.
      T G.C G T.C C T A C A A G A C A A C T A T G G G G C C G.C G T.T T T C G C C G C C A
      What I want to print to a file would be the following tab delimited data:

      ID(occurrence)-Seed-Start-End-Length(FullSequenceLength) like:
      2 -CGT-29-32-42

      I’m sorry to bother any of you with my lack of perl skills, but as a european working/studying biochemistry at a chinese university ; my skills as a researcher are not that well balanced, plus IT personnel has yet to be discovered in my department.

    • #80547

      I don’t have perl with me right now(on vacation) but here’s how I would do it(no testing done so it’s probably buggy).

      code :#!/usr/local/bin/perl

      #fastastream from
      use Bio::FastaStream;
      my $fasta = '/path/to/file.fasta';
      my $seq = Bio::FastaStream->new($fasta);

      #query data here
      $query = "something";
      $querylength = 10;


      # get sequence data
      my $sequence = $seq->getSequence();
      #get length of sequence
      my $length = $seq->getSequenceLength();

      #replace matchpattern with your query sequence, function substitute matches pattern with "." of equal length to query sequence(they're placeholders)
      $sequence =~ s/$query/...../g;

      #cuts sequence into array of individual letters
      chomp $sequence;
      @chars = split(//, $sequence);

      for ($i=0; $i<=$#chars; $i++) {
      if($char[$i] eq "."){
      push(@found, $i);


      print @found;
      print "\n $length";

      I’m not quite sure why you would want the redundancy of printing the seed info and the sequence length multiple times. The end position is also easily found as start+seed length.
      The example I posted does the following things

      1. Scans thru the fasta file and gets the sequence.
      2. Matches the sequence to seed sequence. Each time it matches, it replaces those letters with periods. Make sure you change it so the number of periods added equals the length of the seed sequence. Now your sequence looks something like this


      3. It breaks the sequence down into an array of chars, each array element is a letter. It attempts to find the occurence of ".". The period signifies the start of the seed. Then it skips forward seed length-1 to find the next seed occurence.

      4. Each seed start position is recorded in the found array which gets printed along with the sequence length.

      On an unrelated note, why are you working in China? I thought the Europeans have great bioinformatics programs…the germans certainly do.

    • #80627

      Below I posted the code that I eventually managed to frabricate with some help from tutorials that elucidated perl for me. I read the fasta seed from a fasta file and the genomic db is being read in line by line buffered. The code is not beautiful but does the job. The superfluous printing is because I have many seed sequences and many databases to scan and I don’t want to get lost in my own data. The tabs in the output are for later processing.

      hmm China, where to start. I originally come from Wageningen, The Netherlands. I wanted something more than just studying with mediocre effort and getting a job in the commercial sector (like many students sadly do these days) and came here. Why China? Because China is not ‘Western’.The standards ofcourse are not quite up to par just yet, but they are definitly getting there. Being here is great, it has opened my eyes as a scientist and as a person because the differences are surprisingly immense. Getting things done here takes more effort but that’s has been a learning experience too. The value of having been here the last two years = priceless. I wonder what it will be like though if I decide to go back to Europe, time will tell.

      Mith you’re a legend for helping out people during your vacation.

      With great appreciation,
      my best regards!

      code :#!/usr/local/bin/perl
      use warnings;
      use strict;
      use Carp;

      # Make sure the program is called with two arguments the first the db the other the fasta to scan with

      open(FSTRING,$ARGV[1]) or croak("Database $ARGV[1] not found:$!\n");
      my $fragmenth = <FSTRING>;
      chomp $fragmenth;
      my $fragment = <FSTRING>;
      my $fraglen = length($fragment);
      close (FSTRING);

      # $fragment: the pattern to search for
      # $fraglen: the length of $fragment
      # $buffer: a buffer to hold the DNA from the input file
      # $position: the position of the buffer in the total DNA
      my $position = 0;

      # The first line of a FASTA file is a header and begins with '>'
      open(DNA,"$ARGV[0]") or croak("Cannot open $ARGV[0]:$!\n");
      my $header = <DNA>;
      chomp $header;
      # Get the first line of DNA data, to start the ball rolling
      my $buffer = <DNA>;
      my $count = 0;
      chomp $buffer;
      my $str = time();
      # The remaining lines are DNA data ending with newlines
      while(my $newline = <DNA>) {

      # Add the new line to the buffer
      chomp $newline;
      $buffer .= $newline;
      # Search for the DNA fragment and length
      # (Report the character at string position 0 as being at position 1,
      # as usual in biology)
      while($buffer =~ /$fragment/gi) {
      $count ++;
      print "$count\t$fragment\t$fragmenth\t$ARGV[0]\t",$position + $-[0] + 1,"\t", $position + $-[0] + $fraglen,"\t$str\n";
      open (DBMATCH, ">>output.dat");
      print DBMATCH ("$count\t$fragment\t$fragmenth\t$ARGV[0]\t",$position + $-[0] + 1,"\t", $position + $-[0] + $fraglen,"\t$str \n");
      close (DBMATCH);


      # Reset the position counter (will be true after you reset the buffer, next)
      $position = $position + length($buffer) - $fraglen + 1;

      # Discard the data in the buffer, except for a portion at the end
      # so patterns that appear across line breaks are not missed
      $buffer = substr($buffer, length($buffer) - $fraglen + 1, $fraglen - 1);
      my $totlength = $position + $fraglen - 1;

      open (DBMATCH, ">>output.dat");
      print DBMATCH ("\t\t\t\t\t\t\tTOTAL\t$count\tmatches\t$totlength\tlengh\tath.fa\t$ARGV[0]\t$fragmenth\t$fragment\n\n");
      close (DBMATCH);
      print "\t\t\t\t\t\t\tTOTAL\t$count\tmatches\t$totlength\tlengh\tath.fa\t$ARGV[0]\t$fragmenth\t$fragment\n\n";

      print $str;
      close (DNA);

    • #80642

      Hey, best of luck. Looks better than my code, but if you want to skip all the fasta file parsing, just install the biofastastream module.

Viewing 6 reply threads
  • You must be logged in to reply to this topic.