SQL task in Google BigQuery.
This blog post will take a minute to read and a few minutes to hesitate. Maybe several hours (like it was for me).
Getting straight to business i ll formulate the task (question asked during one of my interviews).
- Open dataset from Google BigQuery called Baseball, you can find it here or just google it (google baseball open dataset).
-
You will (most likely) see three tables:
- games_post_wide
- games_wide
- schedules
-
Now the question:
Find top pitchers with the highest pitch speed for each team through out all seasons and their best pitch speed.
So you should get a three-column table:
- Team (of the pitcher)
- Pitcher (name and second name)
- Speed (best pitch eg top speed)
That s pretty fast, yeah ? I told you this post will take just a few minutes to read.
Now a little review of my take on this. For me it was a challeging task since i just started getting used to UI and SQL in Google BigQuery. Additionally (but neccessary) getting familiar with the dataset (table + structure) took me some time. After getting the structure into my head i applied the question on top of it and mapped out the top level logic of the query. One last thing was to make UNNEST function work (since PostgreSQL and BigQuery are a bit different).
Here below i post the code of my query. If you are a pro - take a minute to review and send your comments here . I d appreciate it, really. If you feel like my solution could help you learn something please post a comment of appreciation here as well. You can easily plug my script and run the query to get the results.
SELECT _team ,_pitcher ,_speed FROM ( -- #rank of the player by his speed (ignoring ties) SELECT _team ,_pitcher ,_speed ,row_number() OVER ( PARTITION BY _team ORDER BY _speed DESC ) _rnk FROM ( -- #finalizing the table with player-team link SELECT _team ,_pitcher ,max(_speed) _speed FROM ( -- #glueing tables with players and teams SELECT _pitcher ,coalesce(_home_team, _away_team) _team ,_speed FROM ( -- #setting up the table with players SELECT venueId ,pitcherId ,pitcherFirstName || ' ' || pitcherLastName _pitcher ,max(pitchSpeed) _speed FROM ( -- #full table comprising all seasons SELECT venueId ,pitcherId ,pitcherFirstName ,pitcherLastName ,pitchSpeed FROM `bigquery - PUBLIC - data.baseball.games_wide` UNION ALL SELECT venueId ,pitcherId ,pitcherFirstName ,pitcherLastName ,pitchSpeed FROM `bigquery - PUBLIC - data.baseball.games_post_wide` ) _all WHERE pitchSpeed != 0 GROUP BY venueId ,pitcherId ,pitcherFirstName || ' ' || pitcherLastName ) _grp -- #joining table with player-team links LEFT JOIN ( SELECT pitcherId ,venueId -- we need to check here if the pitcher is in the list of the homing team ,CASE WHEN pitcherId IN unnest( [homeFielder1 ,homeFielder2 ,homeFielder3 ,homeFielder4 ,homeFielder5 ,homeFielder6 ,homeFielder7 ,homeFielder8 ,homeFielder9 ,homeFielder10 ,homeFielder11 ,homeFielder12 ,homeBatter1 ,homeBatter2 ,homeBatter3 ,homeBatter4 ,homeBatter5 ,homeBatter6 ,homeBatter7 ,homeBatter8 ,homeBatter9] ) THEN homeTeamName ELSE NULL END _home_team -- we need to check here if the pitcher is in the list of the guest team ,CASE WHEN pitcherId IN unnest( [awayFielder1 ,awayFielder2 ,awayFielder3 ,awayFielder4 ,awayFielder5 ,awayFielder6 ,awayFielder7 ,awayFielder8 ,awayFielder9 ,awayFielder10 ,awayFielder11 ,awayFielder12 ,awayBatter1 ,awayBatter2 ,awayBatter3 ,awayBatter4 ,awayBatter5 ,awayBatter6 ,awayBatter7 ,awayBatter8 ,awayBatter9] ) THEN awayTeamName ELSE NULL END _away_team FROM ( SELECT * FROM `bigquery - PUBLIC - data.baseball.games_wide` UNION ALL SELECT * FROM `bigquery - PUBLIC - data.baseball.games_post_wide` ) t ) _team ON _team.pitcherId = _grp.pitcherId AND _team.venueId = _grp.venueId ) _total GROUP BY _team ,_pitcher ) _filter WHERE 1 = 1 AND _team IS NOT NULL AND _pitcher IS NOT NULL AND _pitcher != '' AND _speed != 0 ) _rank WHERE 1 = 1 AND _rnk = 1 ORDER BY _speed DESC
Now let s break down the code:
- The main idea of my take was to use subquery instead of CTE
- I start with UNIONing two tables to get all seasons data (below #full comment)
- Then i join a table with players to see if they play for Home or Guest team (below #joining comment)
- UNNEST is used to get all possible positions where a pitcher can play
- In the above query i get the max speed for each peatcher and gettting their full names (below #setting comment)
- One step above i use COALESCE to get jsut the team name without Home/Guest attribute (below #glueing comment)
- In another step i GROUP the table (below #finalizing COMMENT)
- Pre last step i use ROW_NUMBER to rank pitchers (below #rank comment)
- Finally (in the very upper SELECT) i return results
Thanks for reading. May the force be with you.