regex - How to count the number of bold words and italic words in a markdown syntax file -
i've read bold , italic words can represented in markdown language ** bold_text ** , * italic_text *, respectively. have both bold , italic text @ once, can wrap text 4 asterisks bold , 2 underscores italic (or vice versa).
i write bash script determines number of bold words , italic words. guess comes down counting number of double asterisks , single asterisks, double underscores , single underscores. question how count number of specific strings "**" or "__" file, can know how many bold , italic words there are.
#!/bin/bash if [ -z "$1" ]; echo "no input file specified." else ls $1 > /dev/null 2> /dev/null && echo $(cat $1 | grep -o '\<**>\' | wc -c) || echo "file $1 not exist." fi
example input file:
**this bold , _italic_** text.
expected output:
bold words: 5 italic words: 1 bold , italic words: 1
simple approach
a few assumptions:
- bold uses
__
, italic uses*
(even though might**
,_
) - no "funny stuff" (inline) code these characters, or escaped
_
or*
, or lists leading*
throw our count off
now, count bold words, can use
grep -po '__.*?__' infile.md | grep -o '[^[:space:]]\+' | wc -l
this looks between 2 pairs of __
. used perl regex engine (-p
) enable non-greedy matching (.*?
); otherwise, __bold__ not bold __bold__
1 match. -o
returns matches.
the second grep matches words: sequence of 1 or more non-space characters; , wc -l
counts lines of output.
the same works italics:
grep -po '\*.*?\*' infile.md | grep -o '[^[:space:]]\+' | wc -l
to combine these (for bold and italic), command lists have combined. italic inside bold:
grep -po '__.*?__' infile.md | grep -po '\*.*?\*' | grep -o '[^[:space:]]\+' | wc -l
and bold inside italic:
grep -po '\*.*?\*' infile.md | grep -po '__.*?__' | grep -o '[^[:space:]]\+' | wc -l
cleaning more realistic file
now, real markdown file might have few surprises (see "assumptions"):
* list item **bold word** line **bold words , \* escaped asterisk** here *italicized* word , *italics **bold** word inside* , **bold words *italics* inside** code can have tons of *, ** , _ , want ignore them `inline code can have * , ** , _ ignored`, right?
which render as
- list item bold word
line bold words , * escaped asterisk
here italicized word
and italics bold word inside
and bold words italics inside
code can have tons of *, ** , _ , want ignore them
also
inline code can have * , ** , _ ignored
, right?
one approach clean sed script:
/^$/d # delete empty lines /^ /d # delete code lines (start 4 spaces) s/`[^`]*`//g # remove inline code /^\* /s/^\* (.*)/\1/ # remove asterisk list items s/\\\*//g # remove escaped asterisks s/\\_//g # remove escaped underscores s/`[^`]*`//g # remove inline code s/\*\*/__/g # make sure bold uses underscores s/(^|[^_])_([^_]|$)/\1\*\2/g # make sure italics use asterisks
with following result:
$ sed -rf md.sed infile.md list item __bold word__ line __bold words , escaped asterisk__ here *italicized* word , *italics __bold__ word inside* , __bold words *italics* inside__ , right?
ready consumption commands first section.
putting together
everything in script takes markdown file name argument:
#!/bin/bash fname="$1" tempfile="$(mktemp)" sed -r ' /^$/d /^ /d s/`[^`]*`//g /^\* /s/^\* (.*)/\1/ s/\\\*//g s/\\_//g s/`[^`]*`//g s/\*\*/__/g s/(^|[^_])_([^_]|$)/\1\*\2/g ' "$fname" > "$tempfile" bold=$(grep -po '__.*?__' "$tempfile" | grep -o '[^[:space:]]\+' | wc -l) italic=$(grep -po '\*.*?\*' "$tempfile" | grep -o '[^[:space:]]\+' | wc -l) both=$(( $(grep -po '__.*?__' "$tempfile" | grep -po '\*.*?\*' | grep -o '[^[:space:]]\+' | wc -l) + $(grep -po '\*.*?\*' "$tempfile" | grep -po '__.*?__' | grep -o '[^[:space:]]\+' | wc -l) )) rm -f "$tempfile" echo "bold words: $bold" echo "italic words: $italic" echo "bold , italic words: $both"
which can used this:
$ ./wordcount infile.md bold words: 14 italic words: 8 bold , italic words: 2
shortcomings
- this can tripped words containing underscores. markdown flavours ignore these , assume they're part of word.
- i'm sure missed few edge cases in cleanup
Comments
Post a Comment