html - How to scrape data from a table using a loop to get all td data using python -
so trying data website. , i'm having hard time getting data. can player names thats @ point. been trying different things coming short. here sample code i'm trying go through. note there 2 tables (one each team). , class each player alternates "even" "odd" or "odd" "even" example html file below followed python script. labeled parts want. using python 2.7
`<table id="nbagiteamstats" cellpadding="0" cellspacing="0"> <thead class="nbagiclippers"> <tr> <th colspan="17">los angeles clippers (1-0)</th> <!-- want team name --> </tr> </thead> <tbody><tr colspan="17"> <td colspan="17" class="nbagiboxcat"><span>field goals</span><span>rebounds</span></td> </tr> <tr> <td class="nbagiteamhdrstatsnobord" colspan="1"> </td> <td class="nbagiteamhdrstats">pos</td> <td class="nbagiteamhdrstats">min</td> <td class="nbagiteamhdrstats">fgm-a</td> <td class="nbagiteamhdrstats">3pm-a</td> <td class="nbagiteamhdrstats">ftm-a</td> <td class="nbagiteamhdrstats">+/-</td> <td class="nbagiteamhdrstats">off</td> <td class="nbagiteamhdrstats">def</td> <td class="nbagiteamhdrstats">tot</td> <td class="nbagiteamhdrstats">ast</td> <td class="nbagiteamhdrstats">pf</td> <td class="nbagiteamhdrstats">st</td> <td class="nbagiteamhdrstats">to</td> <td class="nbagiteamhdrstats">bs</td> <td class="nbagiteamhdrstats">ba</td> <td class="nbagiteamhdrstats">pts</td> </tr> <tr class="odd"> <td id="nbagiboxnme" class="b"><a href="/playerfile/paul_pierce/index.html">p. pierce</a></td> <!-- want player name --> <td class="nbagiposition">f</td> <!-- want position name --> <td>14:16</td> <!-- want --> <td>1-4</td> <!-- want --> <td>1-2</td> <!-- want --> <td>2-2</td> <!-- want --> <td>+12</td> <!-- want --> <td>1</td> <!-- want --> <td>0</td> <!-- want --> <td>1</td> <!-- want --> <td>1</td> <!-- want --> <td>3</td> <!-- want --> <td>2</td> <!-- want --> <td>0</td> <!-- want --> <td>0</td> <!-- want --> <td>0</td> <!-- want --> <td>5</td> <!-- want --> </tr> <tr class="even"> <td id="nbagiboxnme" class="b"><a href="/playerfile/blake_griffin/index.html">b. griffin</a></td> <!-- want --> <td class="nbagiposition">f</td> <!-- want --> <td>26:19</td> <!-- want --> <td>5-14</td> <!-- want --> <td>0-1</td> <!-- want --> <td>1-1</td> <!-- want --> <td>+14</td> <!-- want --> <td>0</td> <!-- want --> <td>5</td> <!-- want --> <td>5</td> <!-- want --> <td>2</td> <!-- want --> <td>1</td> <!-- want --> <td>1</td> <!-- want --> <td>1</td> <!-- want --> <td>1</td> <!-- want --> <td>1</td> <!-- want --> <td>11</td> <!-- want --> </tr> <tr class="odd"> <td id="nbagiboxnme" class="b"><a href="/playerfile/deandre_jordan/index.html">d. jordan</a></td> <!-- want --> <td class="nbagiposition">c</td> <!-- want --> <td>26:27</td> <!-- want --> <td>6-7</td> <!-- want --> <td>0-0</td> <!-- want --> <td>3-5</td> <!-- want --> <td>+19</td> <!-- want --> <td>1</td> <!-- want --> <td>11</td> <!-- want --> <td>12</td> <!-- want --> <td>0</td> <!-- want --> <td>1</td> <!-- want --> <td>0</td> <!-- want --> <td>2</td> <!-- want --> <td>3</td> <!-- want --> <td>0</td> <!-- want --> <td>15</td> <!-- want --> </tr> <!-- , on keep changing class odd even, odd --> <!-- note there tables 1 each team --> <!--this table id>>> <table id="nbagiteamstats" cellpadding="0" cellspacing="0"> -->`
this long wanted give example of classes switching here python script plan use dictionary save data once scrape successfully.
import urllib import urllib2 bs4 import beautifulsoup import re gamesforday = ['/games/20151002/denlac/gameinfo.html'] game in gamesforday: url = "http://www.nba.com/"+game page = urllib2.urlopen(url).read() soup = beautifulsoup(page) tr in soup.find_all('table id="nbagiteamstats'): tds = tr.find_all('td') print tds
here solution. note have different version of beautifulsoup, not 1 coming bs4, logic might not off. still on python2.7 (on windows in case).
you need fix nuances player sections not display above, think you'll able handle part :-)
import urllib import urllib2 # bs4 import beautifulsoup beautifulsoup import beautifulsoup import re gamesforday = ['/games/20151002/denlac/gameinfo.html'] game in gamesforday: url = "http://www.nba.com/"+game page = urllib2.urlopen(url).read() soup = beautifulsoup(page) # fetch tables interested in tables = soup.findall(id="nbagiteamstats") table in tables: team_name = table.thead.tr.th.text # odd/even class rows (tr) rows = [ x x in table.findall('tr') if x.get('class',none) in ['odd','even'] ] player in rows: # search row cols based on 'id' player_name = player.find('td', attrs={'id':'nbagiboxnme'}).text # search row cols based on 'class' player_position = player.find('td', attrs={'class':'nbagiposition'}).text # search td class not defined player_numbers = [ x.text x in player.findall('td', attrs={'class':none})] print player_name, player_position, player_numbers
with bs4 (beautifulsoup4 learned) modifications had done. still have handle stuff, extract of data want:
import urllib import urllib2 bs4 import beautifulsoup import re gamesforday = ['/games/20151002/denlac/gameinfo.html'] game in gamesforday: url = "http://www.nba.com/"+game page = urllib2.urlopen(url).read() soup = beautifulsoup(page, "html.parser") # fetch tables interested in tables = soup.findall(id="nbagiteamstats") table in tables: team_name = table.thead.tr.th.text # odd/even class rows (tr) rows = table.find_all(attrs={'class':'odd'}) rows.extend(table.find_all(attrs={'class':'even'})) player in rows: # search row cols based on 'id' player_name = player.find('td', attrs={'id':'nbagiboxnme'}).text # search row cols based on 'class' player_position = player.find('td', attrs={'class':'nbagiposition'}).text # search td class not defined player_numbers = [ x.text x in player.findall('td', attrs={'class':none})] print player_name, player_position, player_numbers
Comments
Post a Comment