ibm bluemix - web URL information to apache spark in web app -


i trying retrieve information epa our web app, needs utilize ibm bluemix , apache spark. information gathering epa this:

https://aqs.epa.gov/api , ftp://ftp.cdc.noaa.gov/datasets/ncep.reanalysis.dailyavgs/surface/

but not gathering historical data, want update data inserting new data every hour web app. hence concerning have few questions:

1) need open hdfs store data? or retrieve data url , store in dataframe? ibm bluemix said provide 5 gb of storage, how 1 utilize store historical data , store updated data per hour?

2) if going update data per hour inserting new data data storage / data frame, should still use spark streaming? if yes, how use spark streaming url data? lot of resources see online useful if 1 has hdfs / formal database.

what doing import urls through pandas:

url = "https://aqs.epa.gov/api/rawdata?user=sogun3@gmail.com&pw=baycrane57&format=json&param=44201&bdate=20110501&edate=20110501&state=37&county=063" import urllib2 content = urllib2.urlopen(url).read() print content 

however, if use method, means spark needs running 24-7 ensure updated data utilized. how 1 configure spark run 24-7? or there better method process data , put them nicely in dataframe data accessed later?

also, in web app, can 1 still use ipython data processing? or ipython interacting data , understanding data experimentally?

thanks lot!

you have options ;-) if need read source epa data , process before use in web app, can use spark service etl (extract transform load) source data epa web site, manipulate or wrangle data shape , size want, , save storage service bluemix object storage. web app read data in format want directly object storage. however, if source epa data largely in format want use in web app, can create rdds directly web site , pull in data , when need it. these datasets small quick peek, don't think need worry spark pulling directly memory work on it; i.e. no need try store locally spark in bluemix service cluster. besides, there no hdfs provided spark service; mentioned earlier, use external storage service. re: "ibm bluemix said provide 5 gb of storage", intended storing personal , 3rd-party spark libraries , such.

re: "spark needs running 24-7". spark service runs 24x7. spark code running on service run long program run ;-)

ipython (or jupyter notebooks) intended repl web. so, yes, interactive. in case, can write spark code in ipython notebook , have run long necessary, pulling , processing epa data web app, storing in object storage. web app can pull data needs object storage. said in future apis provided spark service, @ point web app talk directly spark service; in meantime, can make work notebooks.


Comments

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

css - Make div keyboard-scrollable in jQuery Mobile? -

android - Keyboard hides my half of edit-text and button below it even in scroll view -