sed - Conditional substitution of patterns in bash strings depending on the beginning of a string -
i new in bash, excuse me if not use right terms.
i need substitute patterns of 6 characters in set of files. order patterns substituted depends on beginning of each string of text.
this example of input:
chr1:123-123 5gggttagggttagggttagggttagggtta3 chr1:456-456 5ttagggttagggttagggttagggttaggg3 chr1:789-789 5gggctagggttagggttagggtta3 chr1:123-123 etc name of string, separated string need work tab. string need work delimited characters 5 , 3, can change them.
i want patterns containing t, a, g in of these orders substituted x: ttaggg, taggg, agggtt, gggtta, ggttag, gttagg.
similarly, patterns containing ctaggg, row 3, in orders similar previous 1 substituted different character. game repeated specific differences 6 characters composing each pattern. started writing this:
#!/bin/bash normal=`echo "\033[m"` red=`echo "\033[31m"` #red #read filename input file , create copy , folder output read -p "insert name input file: " input echo "creating output file " "${red}"$input"_sub.txt${normal}" mkdir -p ./"$input"_output cp $input.txt ./"$input"_output/"$input"_sub.txt echo #start first set of instructions perfrep #starting second set of instructions substitute pattern 1 difference ttaggg onemism instructions are
perfrep() { sed -i -e 's/ttaggg/x/g' ./"$input"_output/"$input"_sub.txt sed -i -e 's/tagggt/x/g' ./"$input"_output/"$input"_sub.txt sed -i -e 's/agggtt/x/g' ./"$input"_output/"$input"_sub.txt sed -i -e 's/gggtta/x/g' ./"$input"_output/"$input"_sub.txt sed -i -e 's/ggttag/x/g' ./"$input"_output/"$input"_sub.txt sed -i -e 's/gttagg/x/g' ./"$input"_output/"$input"_sub.txt } # starting second set of instructions substitute pattern 1 difference ttaggg onemism(){ sed -i -e 's/[gca]taggg/l/g' ./"$input"_output/"$input"_sub.txt sed -i -e 's/g[gca]tagg/l/g' ./"$input"_output/"$input"_sub.txt sed -i -e 's/gg[gca]tag/l/g' ./"$input"_output/"$input"_sub.txt sed -i -e 's/ggg[gca]ta/l/g' ./"$input"_output/"$input"_sub.txt sed -i -e 's/aggg[gca]t/l/g' ./"$input"_output/"$input"_sub.txt sed -i -e 's/taggg[gca]/l/g' ./"$input"_output/"$input"_sub.txt } i need repeat t[gca]aggg, tt[tcg]ggg, tta[act]gg, ttag[act]g , ttagg[act].
using procedure, these results inputs shown
5gggxxxxtta3 5xxxxx3 5ggglxxtta3 in point of view, job, first , second string both made x repeated 5 times, , order of characters different. on other hand, third 1 masked this:
5lxxx3 how tell script if string starts 5gggtta instead of 5ttaggg must start substitute with
sed -i -e 's/gggtta/x/g' ./"$input"_output/"$input"_sub.txt instead of
sed -i -e 's/ttaggg/x/g' ./"$input"_output/"$input"_sub.txt ?
i need repeat cases; instance, if string starts gttagg need start with
sed -i -e 's/gttagg/x/g' ./"$input"_output/"$input"_sub.txt and on, , add couple of variation of pattern.
i need repeat substitution ttaggg , variations rows of input file.
sorry long question. thank all.
adding information asked varun.
patterns of 6 characters ttaggg , [gca]taggg , t[gca]aggg , tt[tcg]ggg , tta[act]gg , ttag[act]g , ttagg[act]. each 1 must checked different frame, instance ttaggg have 6 frames ttaggg , gttagg , ggttag, gggtta , agggtt , tagggt. same frames must applied pattern containing variable position.
i have total of 42 patterns check, divided in 7 groups: 1 containing ttaggg , derivative frames, 6 patterns variable position , derivatives. ttaggg , derivatives important , need checked first.
#! /usr/bin/awk -f # generate "frame" moving first char end function rotate(base){ return substr(base,2) substr(base,1,1) } # unfortunately awk arrays not store regexps # generating list of derivative strings match function generate_derivative(frame,arr, i,j,k,head,read,tail) { arr[i]=frame; for(j=1; j<=length(frame); j++) { head=substr(frame,1,j-1); read=substr(frame,j,1); tail=substr(frame,j+1); for( k=1; k<=3; k++) { # use global index simplify arr[++z]= head substr(snp[read],k,1) tail } } } begin{ fs="\t"; # alternatives base snp["a"]="tcg"; snp["t"]="acg"; snp["g"]="atc"; snp["c"]="atg"; # primary target frame="ttaggg"; z=1; # warning global x[z] = frame; # primary derivatives generate_derivative(frame, x); xn = z; # secondary shifted targets , derivatives for(i=1; i<length(frame); i++){ frame = rotate(frame); l[++z] = frame; generate_derivative(frame, l); } } /^chr[0-9:-]*\t5[actg]*3$/ { # because care order of prinary matches (i=1; i<=xn; i++) {gsub(x[i],"x",$2)} # since don't care order of secondary matches (hit in l) {gsub(l[hit],"l",$2)} print } end{ # print matches in order generated #for (i=1; i<=xn; i++) {print x[i]}; #print "" #for (i=1+xn; i<=z; i++) {print l[i]}; } iff can generate static matching order can live above awk script work. primary patterns should take precedence , secondary rule better applied first in cases. (no can do).
if need more flexible matching pattern suggest looking @ "recursive decent parsing backtracking" or "parsing expression grammars". not in bash shell anymore.
Comments
Post a Comment