html - Python scraping pdf from URL -


i want scrape text url "http://www.nycgo.com/venues/thalia-restaurant#menu" text i'm interested in in 'menu' tab on page. tried beautifulsoup text on page, return value following code misses text in menu.

html = urllib2.urlopen("http://www.nycgo.com/venues/thalia-restaurant#menu") html=html.read() soup = bs(html) print soup.get_text() 

it seems content of menu part of html on page when inspect elements menu content. did notice when physically browsing page, takes several seconds menu load. not sure if that's why code above fails menu content.

any insight appreciated.

while soup.get_text() will return of text html document (webpage) problem here menu embedded in page pdf, beautiful soup cannot access. actual pdf file defined in javascript follows:

{     name: "menu",     show: boolean(1),     url: "/assets/files/programs/rw/2016w/thalia-restaurant.pdf" } 

the simplest way extract use regular expressions. while a bad idea, here you're looking specific thing — file, wrapped in "quotes" ending in .pdf. following code find , extract url:

import re urllib import urlopen  html = urlopen("http://www.nycgo.com/venues/thalia-restaurant#menu") html_doc = html.read()  match = re.search(b'\"(.*?\.pdf)\"', html_doc) pdf_url = "http://www.nycgo.com" + match.group(1).decode('utf8') 

now pdf_url is:

u'http://www.nycgo.com/assets/files/programs/rw/2016w/thalia-restaurant.pdf' 

however, extracting text pdf little trickier. can download file first:

from urllib import urlretrieve urlretrieve(pdf_url, "download.pdf") 

then extract text described using function in answer question:

text = convert_pdf_to_txt("download.pdf") print(text) 

returns:

new city  restaurant week  winter 2016  monday - friday 828 eighth avenue new york city, 10019  tel: 212.399.4444  www.restaurantthalia.com  lunch $25 first course creamy polenta fricassee of truffle mushrooms  ... 

Comments

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

css - Make div keyboard-scrollable in jQuery Mobile? -

ruby on rails - Seeing duplicate requests handled with Unicorn -