python - urlopen() not working after using urljoin -
i trying open url search html code word using urlopen().(just crawler). not work when use after urljoin. there way can this.
here code
while len(urls) > 0: htmltext = urlopen(urls[0]).read() soup = beautifulsoup(htmltext) tag in soup.findall('a',href=true): tag['href'] = urljoin(url1,tag['href']) #in_code=tag['href'].read() in_code = urlopen(tag['href']) print(in_code) #print(tag['href']) htmlcode = tag['href'].find('student') if htmlcode > 0: file.write(tag['href']+'\n') urls.pop(); file.close() this error get
c:\python27\crawler>webcrw.py traceback (most recent call last): file "c:\python27\crawler\webcrw.py", line 21, in <module> in_code = urlopen(tag['href']) file "c:\python27\lib\urllib.py", line 86, in urlopen return opener.open(url) file "c:\python27\lib\urllib.py", line 207, in open return getattr(self, name)(url) file "c:\python27\lib\urllib.py", line 344, in open_http h.endheaders(data) file "c:\python27\lib\httplib.py", line 954, in endheaders self._send_output(message_body) file "c:\python27\lib\httplib.py", line 814, in _send_output self.send(msg) file "c:\python27\lib\httplib.py", line 776, in send self.connect() file "c:\python27\lib\httplib.py", line 757, in connect self.timeout, self.source_address) file "c:\python27\lib\socket.py", line 553, in create_connection res in getaddrinfo(host, port, 0, sock_stream): ioerror: [errno socket error] [errno 11004] getaddrinfo failed
Comments
Post a Comment