sed - Conditional substitution of patterns in bash strings depending on the beginning of a string -

i new in bash, excuse me if not use right terms.

i need substitute patterns of 6 characters in set of files. order patterns substituted depends on beginning of each string of text.

this example of input:

chr1:123-123 5gggttagggttagggttagggttagggtta3  chr1:456-456 5ttagggttagggttagggttagggttaggg3   chr1:789-789 5gggctagggttagggttagggtta3

chr1:123-123 etc name of string, separated string need work tab. string need work delimited characters 5 , 3, can change them.

i want patterns containing t, a, g in of these orders substituted x: ttaggg, taggg, agggtt, gggtta, ggttag, gttagg.

similarly, patterns containing ctaggg, row 3, in orders similar previous 1 substituted different character. game repeated specific differences 6 characters composing each pattern. started writing this:

#!/bin/bash normal=`echo "\033[m"` red=`echo "\033[31m"` #red  #read filename input file , create copy , folder output read -p "insert name input file: " input echo "creating output file " "${red}"$input"_sub.txt${normal}" mkdir -p ./"$input"_output cp $input.txt ./"$input"_output/"$input"_sub.txt echo  #start first set of instructions perfrep #starting second set of instructions substitute pattern 1 difference ttaggg onemism

instructions are

perfrep() {     sed -i -e 's/ttaggg/x/g' ./"$input"_output/"$input"_sub.txt     sed -i -e 's/tagggt/x/g' ./"$input"_output/"$input"_sub.txt     sed -i -e 's/agggtt/x/g' ./"$input"_output/"$input"_sub.txt      sed -i -e 's/gggtta/x/g' ./"$input"_output/"$input"_sub.txt          sed -i -e 's/ggttag/x/g' ./"$input"_output/"$input"_sub.txt          sed -i -e 's/gttagg/x/g' ./"$input"_output/"$input"_sub.txt }  # starting second set of instructions substitute pattern 1 difference ttaggg onemism(){     sed -i -e 's/[gca]taggg/l/g' ./"$input"_output/"$input"_sub.txt     sed -i -e 's/g[gca]tagg/l/g' ./"$input"_output/"$input"_sub.txt     sed -i -e 's/gg[gca]tag/l/g' ./"$input"_output/"$input"_sub.txt     sed -i -e 's/ggg[gca]ta/l/g' ./"$input"_output/"$input"_sub.txt     sed -i -e 's/aggg[gca]t/l/g' ./"$input"_output/"$input"_sub.txt     sed -i -e 's/taggg[gca]/l/g' ./"$input"_output/"$input"_sub.txt }

i need repeat t[gca]aggg, tt[tcg]ggg, tta[act]gg, ttag[act]g , ttagg[act].

using procedure, these results inputs shown

5gggxxxxtta3  5xxxxx3   5ggglxxtta3

in point of view, job, first , second string both made x repeated 5 times, , order of characters different. on other hand, third 1 masked this:

5lxxx3

how tell script if string starts 5gggtta instead of 5ttaggg must start substitute with

sed -i -e 's/gggtta/x/g' ./"$input"_output/"$input"_sub.txt

instead of

sed -i -e 's/ttaggg/x/g' ./"$input"_output/"$input"_sub.txt

i need repeat cases; instance, if string starts gttagg need start with

sed -i -e 's/gttagg/x/g' ./"$input"_output/"$input"_sub.txt

and on, , add couple of variation of pattern.

i need repeat substitution ttaggg , variations rows of input file.

sorry long question. thank all.

adding information asked varun.

patterns of 6 characters ttaggg , [gca]taggg , t[gca]aggg , tt[tcg]ggg , tta[act]gg , ttag[act]g , ttagg[act]. each 1 must checked different frame, instance ttaggg have 6 frames ttaggg , gttagg , ggttag, gggtta , agggtt , tagggt. same frames must applied pattern containing variable position.

i have total of 42 patterns check, divided in 7 groups: 1 containing ttaggg , derivative frames, 6 patterns variable position , derivatives. ttaggg , derivatives important , need checked first.

#! /usr/bin/awk -f  # generate "frame" moving first char end function rotate(base){ return substr(base,2) substr(base,1,1) }  # unfortunately awk arrays not store regexps  # generating list of derivative strings match function generate_derivative(frame,arr,  i,j,k,head,read,tail) {     arr[i]=frame;     for(j=1; j<=length(frame); j++) {         head=substr(frame,1,j-1);         read=substr(frame,j,1);         tail=substr(frame,j+1);         for( k=1; k<=3; k++) {            # use global index simplify             arr[++z]= head substr(snp[read],k,1) tail         }     } }  begin{     fs="\t";    # alternatives base    snp["a"]="tcg"; snp["t"]="acg"; snp["g"]="atc";  snp["c"]="atg";       # primary target     frame="ttaggg";     z=1; # warning global     x[z] = frame;     # primary derivatives     generate_derivative(frame, x);     xn = z;      # secondary shifted targets , derivatives     for(i=1; i<length(frame); i++){         frame = rotate(frame);         l[++z] = frame;         generate_derivative(frame, l);     } }  /^chr[0-9:-]*\t5[actg]*3$/ {      # because care order of prinary matches     (i=1; i<=xn; i++) {gsub(x[i],"x",$2)}     # since don't care order of secondary matches     (hit in l) {gsub(l[hit],"l",$2)}     print } end{     # print matches in order generated     #for (i=1; i<=xn; i++) {print x[i]};     #print ""     #for (i=1+xn; i<=z; i++) {print l[i]}; }

iff can generate static matching order can live above awk script work. primary patterns should take precedence , secondary rule better applied first in cases. (no can do).

if need more flexible matching pattern suggest looking @ "recursive decent parsing backtracking" or "parsing expression grammars". not in bash shell anymore.

Search This Blog

Two

sed - Conditional substitution of patterns in bash strings depending on the beginning of a string -

Comments

Post a Comment

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

android - Keyboard hides my half of edit-text and button below it even in scroll view -

css - Make div keyboard-scrollable in jQuery Mobile? -