python - Scrapy: Define items dynamically -
as started learn scrapy, have come accross requirement dynamically build item attributes. i'm scraping webpage has table structure , wanted form item , field attributes while crawling. have gone through example scraping data without having explicitly define each field scraped couldn't make of it.
should writing item pipleline capture info dynamically. have looked @ item loader function, if can explain in detail, helpful.
just use single field arbitrary data placeholder. , when want data out, instead of saying for field in item, for field in item['row']. don't need pipelines or loaders accomplish task, both used extensively reason: worth learning.
spider:
from scrapy.item import item, field scrapy.spider import basespider class tableitem(item): row = field() class testsider(basespider): name = "tabletest" start_urls = ('http://scrapy.org?finger', 'http://example.com/toe') def parse(self, response): item = tableitem() row = dict( foo='bar', baz=[123, 'test'], ) row['url'] = response.url if 'finger' in response.url: row['digit'] = 'my finger' row['appendage'] = 'hand' else: row['foot'] = 'might toe' item['row'] = row return item outptut:
stav@maia:/srv/stav/scrapie/oneoff$ scrapy crawl tabletest 2013-03-14 06:55:52-0600 [scrapy] info: scrapy 0.17.0 started (bot: oneoff) 2013-03-14 06:55:52-0600 [scrapy] debug: overridden settings: {'newspider_module': 'oneoff.spiders', 'spider_modules': ['oneoff.spiders'], 'user_agent': 'chromium oneoff 24.0.1312.56 ubuntu 12.04 (24.0.1312.56-0ubuntu0.12.04.1)', 'bot_name': 'oneoff'} 2013-03-14 06:55:53-0600 [scrapy] debug: enabled extensions: logstats, telnetconsole, closespider, webservice, corestats, spiderstate 2013-03-14 06:55:53-0600 [scrapy] debug: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware, defaultheadersmiddleware, metarefreshmiddleware, httpcompressionmiddleware, redirectmiddleware, cookiesmiddleware, chunkedtransfermiddleware, downloaderstats 2013-03-14 06:55:53-0600 [scrapy] debug: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2013-03-14 06:55:53-0600 [scrapy] debug: enabled item pipelines: 2013-03-14 06:55:53-0600 [tabletest] info: spider opened 2013-03-14 06:55:53-0600 [tabletest] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2013-03-14 06:55:53-0600 [scrapy] debug: telnet console listening on 0.0.0.0:6023 2013-03-14 06:55:53-0600 [scrapy] debug: web service listening on 0.0.0.0:6080 2013-03-14 06:55:53-0600 [tabletest] debug: crawled (200) <get http://scrapy.org?finger> (referer: none) 2013-03-14 06:55:53-0600 [tabletest] debug: scraped <200 http://scrapy.org?finger> {'row': {'appendage': 'hand', 'baz': [123, 'test'], 'digit': 'my finger', 'foo': 'bar', 'url': 'http://scrapy.org?finger'}} 2013-03-14 06:55:53-0600 [tabletest] debug: redirecting (302) <get http://www.iana.org/domains/example/> <get http://example.com/toe> 2013-03-14 06:55:53-0600 [tabletest] debug: redirecting (302) <get http://www.iana.org/domains/example> <get http://www.iana.org/domains/example/> 2013-03-14 06:55:53-0600 [tabletest] debug: crawled (200) <get http://www.iana.org/domains/example> (referer: none) 2013-03-14 06:55:53-0600 [tabletest] debug: scraped <200 http://www.iana.org/domains/example> {'row': {'baz': [123, 'test'], 'foo': 'bar', 'foot': 'might toe', 'url': 'http://www.iana.org/domains/example'}} 2013-03-14 06:55:53-0600 [tabletest] info: closing spider (finished) 2013-03-14 06:55:53-0600 [tabletest] info: dumping scrapy stats: {'downloader/request_bytes': 1066, 'downloader/request_count': 4, 'downloader/request_method_count/get': 4, 'downloader/response_bytes': 3833, 'downloader/response_count': 4, 'downloader/response_status_count/200': 2, 'downloader/response_status_count/302': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2013, 3, 14, 12, 55, 53, 848735), 'item_scraped_count': 2, 'log_count/debug': 13, 'log_count/info': 4, 'response_received_count': 2, 'scheduler/dequeued': 4, 'scheduler/dequeued/memory': 4, 'scheduler/enqueued': 4, 'scheduler/enqueued/memory': 4, 'start_time': datetime.datetime(2013, 3, 14, 12, 55, 53, 99635)} 2013-03-14 06:55:53-0600 [tabletest] info: spider closed (finished)
Comments
Post a Comment