regex - How to count the number of bold words and italic words in a markdown syntax file -

July 15, 2012

i've read bold , italic words can represented in markdown language ** bold_text ** , * italic_text *, respectively. have both bold , italic text @ once, can wrap text 4 asterisks bold , 2 underscores italic (or vice versa).

i write bash script determines number of bold words , italic words. guess comes down counting number of double asterisks , single asterisks, double underscores , single underscores. question how count number of specific strings "**" or "__" file, can know how many bold , italic words there are.

#!/bin/bash  if [ -z "$1" ];     echo "no input file specified." else      ls $1 > /dev/null 2> /dev/null &&      echo $(cat $1 | grep -o '\<**>\' | wc -c) || echo "file $1 not exist." fi

example input file:

**this bold , _italic_** text.

expected output:

 bold words: 5 italic words: 1 bold , italic words: 1

simple approach

a few assumptions:

bold uses __, italic uses * (even though might ** , _)
no "funny stuff" (inline) code these characters, or escaped _ or *, or lists leading * throw our count off

now, count bold words, can use

grep -po '__.*?__' infile.md | grep -o '[^[:space:]]\+' | wc -l

this looks between 2 pairs of __. used perl regex engine (-p) enable non-greedy matching (.*?); otherwise, __bold__ not bold __bold__ 1 match. -o returns matches.

the second grep matches words: sequence of 1 or more non-space characters; , wc -l counts lines of output.

the same works italics:

grep -po '\*.*?\*' infile.md | grep -o '[^[:space:]]\+' | wc -l

to combine these (for bold and italic), command lists have combined. italic inside bold:

grep -po '__.*?__' infile.md | grep -po '\*.*?\*' | grep -o '[^[:space:]]\+' | wc -l

and bold inside italic:

grep -po '\*.*?\*' infile.md | grep -po '__.*?__' | grep -o '[^[:space:]]\+' | wc -l

cleaning more realistic file

now, real markdown file might have few surprises (see "assumptions"):

* list item **bold word**  line **bold words , \* escaped asterisk**  here *italicized* word  , *italics **bold** word inside*  , **bold words *italics* inside**      code can have tons of *, ** , _ , want ignore them  `inline code can have * , ** , _ ignored`, right?

which render as

list item bold word

line bold words , * escaped asterisk

here italicized word

and italics bold word inside

and bold words italics inside
code can have tons of *, ** , _ , want ignore them 
also inline code can have * , ** , _ ignored, right?

one approach clean sed script:

/^$/d                           # delete empty lines /^    /d                        # delete code lines (start 4 spaces) s/`[^`]*`//g                    # remove inline code /^\* /s/^\* (.*)/\1/            # remove asterisk list items s/\\\*//g                       # remove escaped asterisks s/\\_//g                        # remove escaped underscores s/`[^`]*`//g                    # remove inline code s/\*\*/__/g                     # make sure bold uses underscores s/(^|[^_])_([^_]|$)/\1\*\2/g    # make sure italics use asterisks

with following result:

$ sed -rf md.sed infile.md list item __bold word__ line __bold words ,  escaped asterisk__ here *italicized* word , *italics __bold__ word inside* , __bold words *italics* inside__ , right?

ready consumption commands first section.

putting together

everything in script takes markdown file name argument:

#!/bin/bash  fname="$1" tempfile="$(mktemp)"  sed -r '     /^$/d     /^    /d     s/`[^`]*`//g     /^\* /s/^\* (.*)/\1/     s/\\\*//g     s/\\_//g     s/`[^`]*`//g     s/\*\*/__/g     s/(^|[^_])_([^_]|$)/\1\*\2/g ' "$fname" > "$tempfile"  bold=$(grep -po '__.*?__' "$tempfile" | grep -o '[^[:space:]]\+' | wc -l) italic=$(grep -po '\*.*?\*' "$tempfile" | grep -o '[^[:space:]]\+' | wc -l) both=$((     $(grep -po '__.*?__' "$tempfile" |         grep -po '\*.*?\*' | grep -o '[^[:space:]]\+' | wc -l)     +     $(grep -po '\*.*?\*' "$tempfile" |         grep -po '__.*?__' | grep -o '[^[:space:]]\+' | wc -l) ))  rm -f "$tempfile"  echo "bold words: $bold" echo "italic words: $italic" echo "bold , italic words: $both"

which can used this:

$ ./wordcount infile.md bold words: 14 italic words: 8 bold , italic words: 2

shortcomings

this can tripped words containing underscores. markdown flavours ignore these , assume they're part of word.
i'm sure missed few edge cases in cleanup

Search This Blog

Two