python - lxml cleaner to ignore base64 image -


i use lxml.html.clean remove untrusted input in html code. realised lxml removes data: tag in code. want insert image in base64 format (from database, have no file) need tag. instance take

from lxml.html.clean import cleaner cleaner = cleaner() cleaner.clean_html("""     <img src="http://test.com/img.png"/>     <img src="data:image/png;base64,agvsbg8="/> """) 

the result '<span><img src="http://test.com/img.png"><img src=""></span>'. first image not escaped, second yes.

any idea how make accept base64 code without letting pass vulnerabilities ?

i able reproduce behavioral after installing lxml 3.1.0. here solution based on "monkey patching" - replacing lookup regex pattern in lxml.html.clean module exclude links has data:image/.*;base64 removal.

import re import lxml lxml.html.clean import cleaner new_pattern = '\s*(?:javascript:|jscript:|livescript:|vbscript:|data:[^(?:image/.+;base64)]+|about:|mocha:)'  print(new_pattern)  lxml.html.clean._javascript_scheme_re = re.compile(new_pattern, re.i)   cleaner = cleaner() dochtml = """     <img src="http://test.com/img.png"/>     <img src="data:image/png;base64,agvsbg8="/>     <img src="data:unsafe/contents;base64,agvsbg8="/>     <img src="data:text/html;base64,pgh0bww+phnjcmlwdcb0exblpsj0zxh0l2phdmfzy3jpchqipmfszxj0kc‌​doascppc9zy3jpchq+pc9odg1spg=="/> """ r = cleaner.clean_html(dochtml) print(r) 

result

<span><img src="http://test.com/img.png">     <img src="data:image/png;base64,agvsbg8=">     <img src="">     <img src=""> </span> 

the downside of - relies on internal variable name not announced in public interface cleaner. module developers change name of variable or improve version of regex.

to 1 safe side , create url handler on web server return image contents out of database id. in html doc <img src="http://myserver/showimg?id=123213">. involve adding lots of additional moving parts - having web server etc. won't work if undesirable whole world have access images.

old answer:

it should possible configure cleaner keep these tags, cannot reproduce case - works me. i'm using python 2.7.2 , lxml 2.2.8 win-32. please clarify python , lxml version have?

i tried run example , got second image tag contents not removed


Comments

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

android - Keyboard hides my half of edit-text and button below it even in scroll view -

css - Make div keyboard-scrollable in jQuery Mobile? -