Menu

SQL task in Google BigQuery.

This blog post will take a minute to read and a few minutes to hesitate. Maybe several hours (like it was for me).

Getting straight to business i ll formulate the task (question asked during one of my interviews).

  • Open dataset from Google BigQuery called Baseball, you can find it here or just google it (google baseball open dataset).
  • You will (most likely) see three tables:
    • games_post_wide
    • games_wide
    • schedules
  • Now the question: Find top pitchers with the highest pitch speed for each team through out all seasons and their best pitch speed. So you should get a three-column table:
    • Team (of the pitcher)
    • Pitcher (name and second name)
    • Speed (best pitch eg top speed)

That s pretty fast, yeah ? I told you this post will take just a few minutes to read.

Now a little review of my take on this. For me it was a challeging task since i just started getting used to UI and SQL in Google BigQuery. Additionally (but neccessary) getting familiar with the dataset (table + structure) took me some time. After getting the structure into my head i applied the question on top of it and mapped out the top level logic of the query. One last thing was to make UNNEST function work (since PostgreSQL and BigQuery are a bit different).

Here below i post the code of my query. If you are a pro - take a minute to review and send your comments here . I d appreciate it, really. If you feel like my solution could help you learn something please post a comment of appreciation here as well. You can easily plug my script and run the query to get the results.

SELECT _team
        ,_pitcher
        ,_speed
    FROM (
        -- #rank of the player by his speed (ignoring ties)
        SELECT _team
            ,_pitcher
            ,_speed
            ,row_number() OVER (
                PARTITION BY _team ORDER BY _speed DESC
                ) _rnk
        FROM (
            -- #finalizing the table with player-team link
            SELECT _team
                ,_pitcher
                ,max(_speed) _speed
            FROM (
                -- #glueing tables with players and teams
                SELECT _pitcher
                    ,coalesce(_home_team, _away_team) _team
                    ,_speed
                FROM (
                    -- #setting up the table with players
                    SELECT venueId
                        ,pitcherId
                        ,pitcherFirstName || '  ' || pitcherLastName _pitcher
                        ,max(pitchSpeed) _speed
                    FROM (
                        -- #full table comprising all seasons
                        SELECT venueId
                            ,pitcherId
                            ,pitcherFirstName
                            ,pitcherLastName
                            ,pitchSpeed
                        FROM `bigquery - PUBLIC - data.baseball.games_wide`
                        
                        UNION ALL
                        
                        SELECT venueId
                            ,pitcherId
                            ,pitcherFirstName
                            ,pitcherLastName
                            ,pitchSpeed
                        FROM `bigquery - PUBLIC - data.baseball.games_post_wide`
                        ) _all
                    WHERE pitchSpeed != 0
                    GROUP BY venueId
                        ,pitcherId
                        ,pitcherFirstName || '  ' || pitcherLastName
                    ) _grp
                -- #joining table with player-team links
                LEFT JOIN (
                    SELECT pitcherId
                        ,venueId
                        -- we need to check here if the pitcher is in the list of the homing team
                        ,CASE 
                            WHEN pitcherId IN unnest(
                                                    [homeFielder1		
                                                    ,homeFielder2			
                                                    ,homeFielder3			
                                                    ,homeFielder4			
                                                    ,homeFielder5			
                                                    ,homeFielder6			
                                                    ,homeFielder7			
                                                    ,homeFielder8			
                                                    ,homeFielder9			
                                                    ,homeFielder10			
                                                    ,homeFielder11			
                                                    ,homeFielder12			
                                                    ,homeBatter1			
                                                    ,homeBatter2			
                                                    ,homeBatter3			
                                                    ,homeBatter4			
                                                    ,homeBatter5			
                                                    ,homeBatter6			
                                                    ,homeBatter7			
                                                    ,homeBatter8			
                                                    ,homeBatter9]
                                )
                                THEN homeTeamName
                            ELSE NULL
                            END _home_team
                        -- we need to check here if the pitcher is in the list of the guest team
                        ,CASE 
                            WHEN pitcherId IN unnest(
                                                    [awayFielder1		
                                                    ,awayFielder2		
                                                    ,awayFielder3		
                                                    ,awayFielder4		
                                                    ,awayFielder5		
                                                    ,awayFielder6		
                                                    ,awayFielder7		
                                                    ,awayFielder8		
                                                    ,awayFielder9		
                                                    ,awayFielder10		
                                                    ,awayFielder11		
                                                    ,awayFielder12		
                                                    ,awayBatter1		
                                                    ,awayBatter2		
                                                    ,awayBatter3		
                                                    ,awayBatter4		
                                                    ,awayBatter5		
                                                    ,awayBatter6		
                                                    ,awayBatter7		
                                                    ,awayBatter8		
                                                    ,awayBatter9]
                                )
                                THEN awayTeamName
                            ELSE NULL
                            END _away_team
                    FROM (
                        SELECT *
                        FROM `bigquery - PUBLIC - data.baseball.games_wide`
                        
                        UNION ALL
                        
                        SELECT *
                        FROM `bigquery - PUBLIC - data.baseball.games_post_wide`
                        ) t
                    ) _team ON _team.pitcherId = _grp.pitcherId
                    AND _team.venueId = _grp.venueId
                ) _total
            GROUP BY _team
                ,_pitcher
            ) _filter
        WHERE 1 = 1
            AND _team IS NOT NULL
            AND _pitcher IS NOT NULL
            AND _pitcher != ''
            AND _speed != 0
        ) _rank
    WHERE 1 = 1
        AND _rnk = 1
    ORDER BY _speed DESC
    

Now let s break down the code:

  • The main idea of my take was to use subquery instead of CTE
  • I start with UNIONing two tables to get all seasons data (below #full comment)
  • Then i join a table with players to see if they play for Home or Guest team (below #joining comment)
  • UNNEST is used to get all possible positions where a pitcher can play
  • In the above query i get the max speed for each peatcher and gettting their full names (below #setting comment)
  • One step above i use COALESCE to get jsut the team name without Home/Guest attribute (below #glueing comment)
  • In another step i GROUP the table (below #finalizing COMMENT)
  • Pre last step i use ROW_NUMBER to rank pitchers (below #rank comment)
  • Finally (in the very upper SELECT) i return results
I feel (almost sure) this query is not optimal. Once again if you managed to read down to this line i beleive you may have a suggestion/comment - feel free to comment in the post.
Thanks for reading. May the force be with you.