python - Scrapy: Define items dynamically -


as started learn scrapy, have come accross requirement dynamically build item attributes. i'm scraping webpage has table structure , wanted form item , field attributes while crawling. have gone through example scraping data without having explicitly define each field scraped couldn't make of it.

should writing item pipleline capture info dynamically. have looked @ item loader function, if can explain in detail, helpful.

just use single field arbitrary data placeholder. , when want data out, instead of saying for field in item, for field in item['row']. don't need pipelines or loaders accomplish task, both used extensively reason: worth learning.

spider:

from scrapy.item import item, field scrapy.spider import basespider  class tableitem(item):     row = field()  class testsider(basespider):     name = "tabletest"     start_urls = ('http://scrapy.org?finger', 'http://example.com/toe')      def parse(self, response):         item = tableitem()          row = dict(             foo='bar',             baz=[123, 'test'],         )         row['url'] = response.url          if 'finger' in response.url:             row['digit'] = 'my finger'             row['appendage'] = 'hand'         else:             row['foot'] = 'might toe'          item['row'] = row          return item 

outptut:

stav@maia:/srv/stav/scrapie/oneoff$ scrapy crawl tabletest 2013-03-14 06:55:52-0600 [scrapy] info: scrapy 0.17.0 started (bot: oneoff) 2013-03-14 06:55:52-0600 [scrapy] debug: overridden settings: {'newspider_module': 'oneoff.spiders', 'spider_modules': ['oneoff.spiders'], 'user_agent': 'chromium oneoff 24.0.1312.56 ubuntu 12.04 (24.0.1312.56-0ubuntu0.12.04.1)', 'bot_name': 'oneoff'} 2013-03-14 06:55:53-0600 [scrapy] debug: enabled extensions: logstats, telnetconsole, closespider, webservice, corestats, spiderstate 2013-03-14 06:55:53-0600 [scrapy] debug: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware, defaultheadersmiddleware, metarefreshmiddleware, httpcompressionmiddleware, redirectmiddleware, cookiesmiddleware, chunkedtransfermiddleware, downloaderstats 2013-03-14 06:55:53-0600 [scrapy] debug: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2013-03-14 06:55:53-0600 [scrapy] debug: enabled item pipelines: 2013-03-14 06:55:53-0600 [tabletest] info: spider opened 2013-03-14 06:55:53-0600 [tabletest] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2013-03-14 06:55:53-0600 [scrapy] debug: telnet console listening on 0.0.0.0:6023 2013-03-14 06:55:53-0600 [scrapy] debug: web service listening on 0.0.0.0:6080 2013-03-14 06:55:53-0600 [tabletest] debug: crawled (200) <get http://scrapy.org?finger> (referer: none) 2013-03-14 06:55:53-0600 [tabletest] debug: scraped <200 http://scrapy.org?finger>     {'row': {'appendage': 'hand',              'baz': [123, 'test'],              'digit': 'my finger',              'foo': 'bar',              'url': 'http://scrapy.org?finger'}} 2013-03-14 06:55:53-0600 [tabletest] debug: redirecting (302) <get http://www.iana.org/domains/example/> <get http://example.com/toe> 2013-03-14 06:55:53-0600 [tabletest] debug: redirecting (302) <get http://www.iana.org/domains/example> <get http://www.iana.org/domains/example/> 2013-03-14 06:55:53-0600 [tabletest] debug: crawled (200) <get http://www.iana.org/domains/example> (referer: none) 2013-03-14 06:55:53-0600 [tabletest] debug: scraped <200 http://www.iana.org/domains/example>     {'row': {'baz': [123, 'test'],              'foo': 'bar',              'foot': 'might toe',              'url': 'http://www.iana.org/domains/example'}} 2013-03-14 06:55:53-0600 [tabletest] info: closing spider (finished) 2013-03-14 06:55:53-0600 [tabletest] info: dumping scrapy stats:     {'downloader/request_bytes': 1066,      'downloader/request_count': 4,      'downloader/request_method_count/get': 4,      'downloader/response_bytes': 3833,      'downloader/response_count': 4,      'downloader/response_status_count/200': 2,      'downloader/response_status_count/302': 2,      'finish_reason': 'finished',      'finish_time': datetime.datetime(2013, 3, 14, 12, 55, 53, 848735),      'item_scraped_count': 2,      'log_count/debug': 13,      'log_count/info': 4,      'response_received_count': 2,      'scheduler/dequeued': 4,      'scheduler/dequeued/memory': 4,      'scheduler/enqueued': 4,      'scheduler/enqueued/memory': 4,      'start_time': datetime.datetime(2013, 3, 14, 12, 55, 53, 99635)} 2013-03-14 06:55:53-0600 [tabletest] info: spider closed (finished) 

Comments

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

android - Keyboard hides my half of edit-text and button below it even in scroll view -

css - Make div keyboard-scrollable in jQuery Mobile? -