python - lxml cleaner to ignore base64 image -
i use lxml.html.clean remove untrusted input in html code. realised lxml removes data: tag in code. want insert image in base64 format (from database, have no file) need tag. instance take
from lxml.html.clean import cleaner cleaner = cleaner() cleaner.clean_html(""" <img src="http://test.com/img.png"/> <img src="data:image/png;base64,agvsbg8="/> """) the result '<span><img src="http://test.com/img.png"><img src=""></span>'. first image not escaped, second yes.
any idea how make accept base64 code without letting pass vulnerabilities ?
i able reproduce behavioral after installing lxml 3.1.0. here solution based on "monkey patching" - replacing lookup regex pattern in lxml.html.clean module exclude links has data:image/.*;base64 removal.
import re import lxml lxml.html.clean import cleaner new_pattern = '\s*(?:javascript:|jscript:|livescript:|vbscript:|data:[^(?:image/.+;base64)]+|about:|mocha:)' print(new_pattern) lxml.html.clean._javascript_scheme_re = re.compile(new_pattern, re.i) cleaner = cleaner() dochtml = """ <img src="http://test.com/img.png"/> <img src="data:image/png;base64,agvsbg8="/> <img src="data:unsafe/contents;base64,agvsbg8="/> <img src="data:text/html;base64,pgh0bww+phnjcmlwdcb0exblpsj0zxh0l2phdmfzy3jpchqipmfszxj0kcdoascppc9zy3jpchq+pc9odg1spg=="/> """ r = cleaner.clean_html(dochtml) print(r) result
<span><img src="http://test.com/img.png"> <img src="data:image/png;base64,agvsbg8="> <img src=""> <img src=""> </span> the downside of - relies on internal variable name not announced in public interface cleaner. module developers change name of variable or improve version of regex.
to 1 safe side , create url handler on web server return image contents out of database id. in html doc <img src="http://myserver/showimg?id=123213">. involve adding lots of additional moving parts - having web server etc. won't work if undesirable whole world have access images.
old answer:
it should possible configure cleaner keep these tags, cannot reproduce case - works me. i'm using python 2.7.2 , lxml 2.2.8 win-32. please clarify python , lxml version have?
i tried run example , got second image tag contents not removed
Comments
Post a Comment