i am not getting kannada text when i run the perl script on a file -
i having following code extracting text html files , writing text file. in html contain kannada text(utf-8) when programs runs getting text file in getting text not in proper formate. text in unreadable formate
enter code here use utf8; use html::formattext; $string = html::formattext->format_file( 'a.html', leftmargin => 0, rightmargin => 50 ); open mm,">t1.txt"; print mm "$string"; so please me.how handle file formates while processing it.
if understand correctly, want output file utf-8 encoded characters kannada language encoded in output correctly. code trying (and failing) encode incorrectly iso-8859-1 instead.
if so, can make sure file opened utf-8 encoding filter.
use html::formattext; open $htmlfh, '<:encoding(utf-8)', 'a.html' or die "cannot open a.html: $!"; $content = { local $/; <$htmlfh> }; # read content file close $htmlfh; $string = html::formattext->format_string( $content, leftmargin => 0, rightmargin => 50 ); open $mm, '>:encoding(utf-8)', 't1.txt' or die "cannot open t1.txt: $!"; print $mm $string; for further reading, recommend checking out these docs:
a few other notes:
- the
use utf8line makes perl script/library may contain utf formatting. not make changes how read or write files. - avoid using two-argument forms of
open()in example. may allow malicious user compromise system in cases. (though, usage in example happens safe. - when opening file, need add
or dieafterwards or failures read or write file silently ignored.
update 3/12: changed read file in utf-8 , send html::formattext. if a.html file saved bom character @ start, may have done right thing anyway, should make assume utf-8 incoming file.
Comments
Post a Comment