LogIn
I don't have account.

SQL Interview Questions and Answers Asked in FAANG

Jordan M.
80 Views

Preparing for an SQL interview at FAANG (Meta, Amazon, Apple, Netflix, Google) requires far more than knowing basic commands or syntax. These companies expect strong analytical thinking, performance awareness and the ability to write efficient SQL for massive datasets.

This article covers the most frequently asked SQL interview questions and answers from FAANG companies like Meta, Amazon, Apple, Netflix and Google. Whether you're a beginner or an experienced data professional, you’ll learn essential SQL concepts, real-world examples and problem-solving strategies used in high-level tech interviews. Perfect for Data Analysts, Data Engineers, Backend Developers and anyone preparing for advanced SQL interview rounds.

1. Histogram of Tweets

You are given a table that contains Twitter tweet data. Your task is to write a SQL query that creates a histogram of how many tweets each user posted in the year 2024.

In simple terms

  • Count how many tweets each user made in 2024.
  • Then group users by that tweet count.
  • Finally, show how many users fall into each group.

The output should have

  • tweet_bucket : number of tweets a user posted in 2024.
  • users_num : number of users who posted that many tweets
Tables
Column Name Type
tweet_id integer
user_id integer
msg string
tweet_date timestamp

Example Input

tweet_id user_id msg tweet_date
214252 111 Am considering taking Tesla private at $420… 09/05/2025 00:00:00
739252 111 Despite the constant negative press covfefe 09/11/2024 00:00:00
846402 111 Following @NickSinghTech on Twitter changed my life! 02/04/2024 00:00:00
241425 254 If the salary is so competitive why won't you tell me… 07/11/2024 00:00:00
231574 148 I no longer have a manager. I can't be managed 08/03/2024 00:00:00

Expected Output

tweet_bucket users_num
1 2
2 1

Explanation

  • User 111 tweeted two times in 2024.
  • Users 254 and 148 each tweeted once in 2024.

So: The “1 tweet” bucket has 2 users and the “2 tweets” bucket has 1 user

Answer

Try to write the query yourself before checking the answer.

First Count tweets per user using inside subquery after that Build the histogram using outer query

Copy
SELECT
  tweet_count AS tweet_bucket,
  COUNT(*) AS users_num
FROM (
    -- Count how many tweets each user made in 2024
    SELECT
      user_id,
      COUNT(*) AS tweet_count
    FROM tweets
    WHERE tweet_date >= '2024-01-01'
      AND tweet_date <  '2025-01-01'
    GROUP BY user_id
) AS user_tweets
GROUP BY tweet_count
ORDER BY tweet_count;
  • The inner query counts how many tweets each user made in 2024.
  • The outer query groups those counts and produces the histogram.
Optimized Query Using a CTE

To optimize this query, I focus on two things, reducing the amount of data processed and ensuring the database uses the most efficient access path.

First, I filter the tweets table to only include data from 2024. This dramatically shrinks the dataset early, which helps avoid unnecessary scans on irrelevant years. To support this, I would add a composite or partial index on (user_id, tweet_date) filtered for 2024. This allows the database to jump directly to the needed rows instead of scanning the full table.

Next, I aggregate at the lowest possible level per user. This produces a much smaller intermediate result containing just user_id and their tweet counts. Once I have this compact dataset, generating the histogram simply requires grouping by the tweet count and counting how many users fall into each bucket.

In a high-scale environment like Twitter or Amazon, these steps ensure the query remains performant even when the tweets table contains billions of rows.

Copy
WITH user_tweet_counts AS (
    SELECT
        user_id,
        COUNT(*) AS tweet_count
    FROM tweets
    WHERE tweet_date >= '2024-01-01'
      AND tweet_date <  '2025-01-01'
    GROUP BY user_id
)
SELECT
    tweet_count AS tweet_bucket,
    COUNT(*) AS users_num
FROM user_tweet_counts
GROUP BY tweet_count
ORDER BY tweet_count;
  • Filter early: We only look at rows from 2024 in the inner step (CTE).
  • Use index effectively: The (user_id, tweet_date) index with the 2024 filter helps both filtering and grouping.
  • Small intermediate result: We first reduce data to user_id + tweet_count, then build the histogram from that smaller set.

2. Identify Candidates with All Required Data Science Skills

You’re given a table that lists each candidate along with the skills they have. A company is hiring for a Data Science position and the role requires candidates who are proficient in all three of the following skills Python, Tableau and PostgreSQL.

Write an SQL query to find all the candidates who have all three of these required skills. Make sure the results are ordered by the candidate ID in ascending order.

candidates Table
Column Name Type
candidate_id integer
skill varchar
Example Input
candidate_id skill
123 Python
123 Tableau
123 PostgreSQL
234 R
234 PowerBI
234 SQL Server
345 Python
345 Tableau
Expected Output
candidate_id
123
Explanation
  • Candidate 123 has all three required skills : Python, Tableau, PostgreSQL -> included.
  • Candidate 345 is missing PostgreSQL -> excluded.
  • Candidate 234 does not have any of the required Data Science skills -> excluded.
  • So only 123 satisfies all the skill requirements.

Answer

First, try to write the query yourself. Before looking at the answers, think through : How do I filter only the required skills?, How do I make sure a candidate has all 3 skills (not just one or two)?, How do I group and count properly?

To solve this, I first focus on the requirement: we only want candidates who have all three specific skills : Python, Tableau and PostgreSQL.

The data is stored in a row-per-skill format, so I filter the table to only those three skills using a WHERE ... IN clause. Then I group by candidate_id and use HAVING COUNT(DISTINCT skill) = 3 to ensure the candidate has all three required skills, not just one or two. Finally, I order by candidate_id as requested.

On a large dataset, I’d also consider adding an index on (skill, candidate_id) so the database can quickly locate relevant rows and perform the grouping efficiently.

Copy
SELECT
  candidate_id
FROM candidates
WHERE skill IN ('Python', 'Tableau', 'PostgreSQL')
GROUP BY candidate_id
HAVING COUNT(DISTINCT skill) = 3
ORDER BY candidate_id;

3. Find the Third Transaction for Each Uber User

You are given a table named transactions that records every spend event made by Uber users. Each row represents one transaction.

Your task is to write an SQL query that returns the third transaction for every user, based on transaction date order (oldest -> newest). Only users who have at least three transactions should appear in the output.

transactions Table
Column Name Type
user_id integer
spend decimal
transaction_date timestamp
Example Input
user_id spend transaction_date
111 88.50 01/08/2024 12:30:00
111 54.00 01/10/2024 12:05:00
121 46.00 01/18/2024 12:10:00
145 24.99 01/26/2024 12:20:00
111 95.50 02/05/2024 12:50:33
Expected Output
user_id spend transaction_date
111 95.50 02/05/2024 12:50:33

Answers

1. Beginner-Friendly Approach (ROW_NUMBER Subquery)

This approach uses a window function to assign a ranking number to each transaction based on the date. For each individual user, their oldest transaction becomes 1, next becomes 2, next becomes 3 and so on.

Once every transaction is numbered, we simply pick the rows where the rank is 3, because that means it's the third transaction for that user.

Copy
SELECT
  user_id,
  spend,
  transaction_date
FROM (
    SELECT
        user_id,
        spend,
        transaction_date,
        ROW_NUMBER() OVER (
            PARTITION BY user_id
            ORDER BY transaction_date
        ) AS rn
    FROM transactions
) AS t
WHERE rn = 3;
  • PARTITION BY user_id : Restart ranking for each user. This ensures each user’s transactions are ranked separately.
  • ORDER BY transaction_date : This sorts the user’s transactions in chronological order, oldest first.
  • ROW_NUMBER() : Assigns numbers 1, 2, 3, … to transactions within each user’s group.
  • Outer query filters WHERE rn = 3 : This returns only the third transaction for each user.

2. Using CTE (Common Table Expression)

The CTE version applies the same logic but organizes it inside a WITH clause, which makes the query easier to read, maintain and extend especially when working with large datasets or more complex requirements.

Copy
WITH ranked AS (
    SELECT
        user_id,
        spend,
        transaction_date,
        ROW_NUMBER() OVER (
            PARTITION BY user_id
            ORDER BY transaction_date
        ) AS rn
    FROM transactions
)
SELECT user_id, spend, transaction_date
FROM ranked
WHERE rn = 3;
  • Use of a CTE (WITH ranked AS …) : Makes it clearer what “ranking per user” means and keeps the final SELECT clean.
  • Window function stays the same : ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY transaction_date) is still the best way to label transaction order.
  • Filtering on rn = 3 : The final result now cleanly selects exactly the third transaction.
  • Index Optimization : Adding an index on (user_id, transaction_date) helps the database efficiently locate all rows for a specific user and sort them by date. This significantly reduces the work required for sorting and partitioning, especially when dealing with large datasets.

In this case, whether I use a subquery or a CTE, the logic and execution plan are effectively the same. The main performance factor is the window function over (PARTITION BY user_id ORDER BY transaction_date) and whether we support that with a good index like (user_id, transaction_date). So in terms of speed, both approaches are equal, the choice between a CTE and a subquery mainly depends on which one makes the query clearer and easier to maintain.

4. Find AWS Server Utilization Time

Amazon Web Services (AWS) runs on large fleets of servers. Each server can be started and stopped multiple times, and management wants to understand how long, in total, the servers were running.

You are given a table server_utilization that logs when each server starts or stops

server_utilization Table
Column Name Type
server_id integer
status_time timestamp
session_status string
Example Input
server_id status_time session_status
1 08/02/2022 10:00:00 start
1 08/04/2022 10:00:00 stop
2 08/17/2022 10:00:00 start
2 08/24/2022 10:00:00 stop
Example Output
total_uptime_days
9

Answers

Try to write the query yourself before seeing the answer . Hints to think about: For each start, you need the next status_time (the corresponding stop). You can get the “next row” using a window function like LEAD(). For each start -> stop pair, compute the time difference. Sum all sessions’ durations across all servers. Convert the total time into full days. Once you’ve tried, compare with the solution below.

Copy
WITH sessions AS (
    SELECT
        server_id,
        status_time,
        session_status,
        LEAD(status_time) OVER (
            PARTITION BY server_id
            ORDER BY status_time
        ) AS next_status_time
    FROM server_utilization
),
running_intervals AS (
    SELECT
        server_id,
        status_time AS start_time,
        next_status_time AS stop_time
    FROM sessions
    WHERE session_status = 'start'
      AND next_status_time IS NOT NULL
)
SELECT
    SUM(TIMESTAMPDIFF(DAY, start_time, stop_time)) total_uptime_days
FROM running_intervals;
Explanation

The first part of the query creates a temporary result called sessions. Inside this step, each event is ordered by time for its server and the LEAD() function is used to look ahead and retrieve the timestamp of the next event. This means that if you are looking at a start event, LEAD() provides the exact time when the same server later stopped. As a result, each start event becomes paired with its corresponding stop event. something like

server_id status_time session_status next_status_time

In the next part, the query filters this list and keeps only the rows where the event is a “start” and there is a valid next event. This produces a clean list of running intervals. For example, if a server started on August 2nd at 10 AM and stopped on August 4th at 10 AM, this becomes one complete interval from which the uptime can be calculated.

The final step calculates the duration of each interval using TIMESTAMPDIFF(DAY, start_time, stop_time), which returns how many full days passed between the start and stop times. After calculating the durations of all these intervals, the query sums all of them into a single number that represents the total number of days the fleet of servers was operating.

Overall, the query works by identifying start–stop pairs, turning them into running intervals, converting those intervals into days and summing everything together to produce the final total uptime in days.

We can rewrite the query using a single CTE while keeping the exact same logic. In this simplified version, the CTE immediately renames status_time as start_time and the LEAD(status_time) value as end_time, while still keeping session_status so we know which rows are start events. That way, when we come out of the CTE, we already have everything we need :- the server id, the start time, the next end time and the session status.

Because of that, the final SELECT can now operate directly on sessions. We simply filter on session_status = 'start' and compute SUM(TIMESTAMPDIFF(DAY, start_time, end_time)) to get the total uptime in days. Logically, this is same to the earlier multi-CTE version. we are still pairing each start with the next event using LEAD, still only considering start rows and still summing the duration between start and end times.

Copy
WITH sessions AS (
    SELECT
        server_id,
        status_time AS start_time,
        LEAD(status_time) OVER (
            PARTITION BY server_id
            ORDER BY status_time
        ) AS end_time,
        session_status
    FROM server_utilization
)
SELECT
    SUM(TIMESTAMPDIFF(DAY, start_time, end_time)) AS total_uptime_days
FROM sessions
WHERE session_status = 'start';

5. Find Facebook Pages With No Likes

Facebook hosts millions of Pages that users can follow by clicking the “Like” button. Each page can receive zero, one or many likes from different users. As part of an analytics task, you are required to identify which Pages currently have no likes at all.

pages Table
Column Name Type
page_id integer
page_name varchar
pages Example Input
page_id page_name
20001 SQL Solutions
20045 Brain Exercises
20701 Tips for Data Analysts
page_likes Table
Column Name Type
user_id integer
page_id integer
liked_date datetime
page_likes Example Input
user_id page_id liked_date
111 20001 04/08/2022 00:00:00
121 20045 03/12/2022 00:00:00
156 20001 07/25/2022 00:00:00
Example Output
page_id
20701

Answers

Try to write the query yourself before checking the answer. Think about: How do we find pages that do not appear in the page_likes table? Should we use a LEFT JOIN or NOT EXISTS? How do we ensure the result is sorted? Once you think it through, compare with the solutions below.

Copy
SELECT
    p.page_id
FROM pages p
LEFT JOIN page_likes pl
    ON p.page_id = pl.page_id
WHERE pl.page_id IS NULL
ORDER BY p.page_id;
  • LEFT JOIN keeps all pages from the pages table.
  • For pages that have no matching rows in page_likes, the columns from page_likes become NULL.
  • We check for: `WHERE pl.page_id IS NULL` This gives us pages that nobody has liked.
Alternative Solution : Using NOT IN

Select all pages whose page_id does not appear in the list of pages that have received likes. This filters out every page that appears in page_likes.

Copy
SELECT page_id
FROM pages
WHERE page_id NOT IN (
    SELECT page_id FROM page_likes
)
ORDER BY page_id;

This approach checks whether a page’s ID is not present in the list of pages that have received likes.

Although it looks straightforward, it has a major weakness: if the subquery returns even one NULL, the entire comparison fails and no rows are returned. This makes NOT IN unreliable unless you manually filter out NULLs in the subquery. Because of this pitfall, most developers avoid using NOT IN for anti-joins.

Note : If the subquery returns even one NULL, the entire result becomes empty this is a classic SQL pitfall.

Alternative Solution : Using NOT EXISTS (Very Common & Safe)

Use an anti-join with NOT EXISTS to return pages that have no matching like records.

Copy
SELECT
    p.page_id
FROM pages p
WHERE NOT EXISTS (
    SELECT 1
    FROM page_likes pl
    WHERE pl.page_id = p.page_id
)
ORDER BY p.page_id;

It is highly reliable because it correctly handles NULLs and stops searching as soon as a matching row is found, making it efficient.

This pattern is widely used at FAANG companies because it clearly expresses intent and scales well with large datasets, especially when page_likes.page_id is indexed.

Final Recommendation
  • Best: NOT EXISTS - safest, cleanest, best for large datasets
  • Good: LEFT JOIN … IS NULL - also correct but slightly less clear
  • Avoid: NOT IN -breaks easily with NULLs and less efficient