SQL SELF JOIN Tutorial Example

Hi everyone. So I need help understanding the following example in the SELF JOIN lesson.

SELECT DISTINCT japan_investments.company_name,
japan_investments.company_permalink
FROM tutorial.crunchbase_investments_part1 japan_investments
JOIN tutorial.crunchbase_investments_part1 gb_investments
  ON japan_investments.company_name = gb_investments.company_name
  AND gb_investments.investor_country_code = 'GBR'
  AND gb_investments.funded_at > japan_investments.funded_at
WHERE japan_investments.investor_country_code = 'JPN'
ORDER BY 1

I'm not following how SQL is going through and comparing the rows. For example, from my understanding:

AND gb_investments.funded_at > japan_investments.funded_at 

Can never be true since both of those entries are in the same column/row position. In other words, gb_investments.funded_at  should equal japan_investments.funded_at

Also, I don't see how the WHERE statement can work since all of the rows have already been filtered country_code = 'GBR'. Indeed, if you print all the rows to this query, there are no rows with 'JPN' as the country_code. 

1reply Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
  • What this code is doing is joining a table to it's self, thus making a Cartesian product of all rows ( https://en.wikipedia.org/wiki/Cartesian_product ) with two sets of values for each output row access from the japan_investments side and gb_investments set of input values.

    The join first limits only for companies with the same name, and that the GB side is from GBR and the gb side was funded after the JP side. Then finally the JP side is from JPN.

    Now the all these clause could have been done in the ON or the WHERE because it's an INNER JOIN that are equivalent (in this case). 

    But the real thing to note is we have access to both the "left" and "right" version of the table values via their aliases, thus can apply filters to the matching rows.

    create table investments (name text, country text, funded_at timestamp);
    insert into investments values ('A','GBR', '2018-05-01'), -- not funed in JPN
        ('B','GBR', '2018-05-01'),('B','JPN', '2018-05-02'), -- funded in JPN after GBR
        ('C','GBR', '2018-05-01'),('C','JPN', '2018-04-01'), -- funced in GBR after JPN (the only answer)
        ('D','JPN', '2018-04-03'); -- funced in GBR after JPN (the only answer)
    
    
    select distinct japan_investments.name
    --,japan_investments.company_permalink
    FROM investments japan_investments
    JOIN investments gb_investments
      ON japan_investments.name = gb_investments.name
      AND gb_investments.country = 'GBR'
      AND gb_investments.funded_at > japan_investments.funded_at
    WHERE japan_investments.country = 'JPN'
    ORDER BY 1

    which gives only the answer C. Given how I inserted the values into the base table in company name pairs you can visualize how the data is getting compared after the  japan_investments.name = gb_investments.name clause has run.

    Reply Like
reply to topic
Like Follow
  • 3 wk agoLast active
  • 1Replies
  • 180Views
  • 3 Following

Create A New Discussion

Share your thoughts