SQL Interview Questions and Answers Asked in FAANG

Published : 25 November 2025

102 Views

Preparing for an SQL interview at FAANG (Meta, Amazon, Apple, Netflix, Google) requires far more than knowing basic commands or syntax. These companies expect strong analytical thinking, performance awareness and the ability to write efficient SQL for massive datasets.

This article covers the most frequently asked SQL interview questions and answers from FAANG companies like Meta, Amazon, Apple, Netflix and Google. Whether you're a beginner or an experienced data professional, you’ll learn essential SQL concepts, real-world examples and problem-solving strategies used in high-level tech interviews. Perfect for Data Analysts, Data Engineers, Backend Developers and anyone preparing for advanced SQL interview rounds.

1. Histogram of Tweets

You are given a table that contains Twitter tweet data. Your task is to write a SQL query that creates a histogram of how many tweets each user posted in the year 2024.

In simple terms

Count how many tweets each user made in 2024.
Then group users by that tweet count.
Finally, show how many users fall into each group.

The output should have

tweet_bucket : number of tweets a user posted in 2024.
users_num : number of users who posted that many tweets

Tables

Column Name	Type
tweet_id	integer
user_id	integer
msg	string
tweet_date	timestamp

Example Input

tweet_id	user_id	msg	tweet_date
214252	111	Am considering taking Tesla private at $420…	09/05/2025 00:00:00
739252	111	Despite the constant negative press covfefe	09/11/2024 00:00:00
846402	111	Following @NickSinghTech on Twitter changed my life!	02/04/2024 00:00:00
241425	254	If the salary is so competitive why won't you tell me…	07/11/2024 00:00:00
231574	148	I no longer have a manager. I can't be managed	08/03/2024 00:00:00

Expected Output

tweet_bucket	users_num
1	2
2	1

Explanation

User 111 tweeted two times in 2024.
Users 254 and 148 each tweeted once in 2024.

So: The “1 tweet” bucket has 2 users and the “2 tweets” bucket has 1 user

Answer

Try to write the query yourself before checking the answer.

First Count tweets per user using inside subquery after that Build the histogram using outer query

SELECT
  tweet_count AS tweet_bucket,
  COUNT(*) AS users_num
FROM (
    -- Count how many tweets each user made in 2024
    SELECT
      user_id,
      COUNT(*) AS tweet_count
    FROM tweets
    WHERE tweet_date >= '2024-01-01'
      AND tweet_date <  '2025-01-01'
    GROUP BY user_id
) AS user_tweets
GROUP BY tweet_count
ORDER BY tweet_count;

The inner query counts how many tweets each user made in 2024.
The outer query groups those counts and produces the histogram.

Optimized Query Using a CTE

To optimize this query, I focus on two things, reducing the amount of data processed and ensuring the database uses the most efficient access path.

First, I filter the tweets table to only include data from 2024. This dramatically shrinks the dataset early, which helps avoid unnecessary scans on irrelevant years. To support this, I would add a composite or partial index on (user_id, tweet_date) filtered for 2024. This allows the database to jump directly to the needed rows instead of scanning the full table.

Next, I aggregate at the lowest possible level per user. This produces a much smaller intermediate result containing just user_id and their tweet counts. Once I have this compact dataset, generating the histogram simply requires grouping by the tweet count and counting how many users fall into each bucket.

In a high-scale environment like Twitter or Amazon, these steps ensure the query remains performant even when the tweets table contains billions of rows.

WITH user_tweet_counts AS (
    SELECT
        user_id,
        COUNT(*) AS tweet_count
    FROM tweets
    WHERE tweet_date >= '2024-01-01'
      AND tweet_date <  '2025-01-01'
    GROUP BY user_id
)
SELECT
    tweet_count AS tweet_bucket,
    COUNT(*) AS users_num
FROM user_tweet_counts
GROUP BY tweet_count
ORDER BY tweet_count;

Filter early: We only look at rows from 2024 in the inner step (CTE).
Use index effectively: The (user_id, tweet_date) index with the 2024 filter helps both filtering and grouping.
Small intermediate result: We first reduce data to user_id + tweet_count, then build the histogram from that smaller set.

2. Identify Candidates with All Required Data Science Skills

You’re given a table that lists each candidate along with the skills they have. A company is hiring for a Data Science position and the role requires candidates who are proficient in all three of the following skills Python, Tableau and PostgreSQL.

Write an SQL query to find all the candidates who have all three of these required skills. Make sure the results are ordered by the candidate ID in ascending order.

candidates Table

Column Name	Type
candidate_id	integer
skill	varchar

Example Input

candidate_id	skill
123	Python
123	Tableau
123	PostgreSQL
234	R
234	PowerBI
234	SQL Server
345	Python
345	Tableau

Expected Output

candidate_id
123

Explanation

Candidate 123 has all three required skills : Python, Tableau, PostgreSQL -> included.
Candidate 345 is missing PostgreSQL -> excluded.
Candidate 234 does not have any of the required Data Science skills -> excluded.
So only 123 satisfies all the skill requirements.

Answer

First, try to write the query yourself. Before looking at the answers, think through : How do I filter only the required skills?, How do I make sure a candidate has all 3 skills (not just one or two)?, How do I group and count properly?

To solve this, I first focus on the requirement: we only want candidates who have all three specific skills : Python, Tableau and PostgreSQL.

The data is stored in a row-per-skill format, so I filter the table to only those three skills using a WHERE ... IN clause. Then I group by candidate_id and use HAVING COUNT(DISTINCT skill) = 3 to ensure the candidate has all three required skills, not just one or two. Finally, I order by candidate_id as requested.

On a large dataset, I’d also consider adding an index on (skill, candidate_id) so the database can quickly locate relevant rows and perform the grouping efficiently.

SELECT
  candidate_id
FROM candidates
WHERE skill IN ('Python', 'Tableau', 'PostgreSQL')
GROUP BY candidate_id
HAVING COUNT(DISTINCT skill) = 3
ORDER BY candidate_id;

3. Find the Third Transaction for Each Uber User

You are given a table named transactions that records every spend event made by Uber users. Each row represents one transaction.

Your task is to write an SQL query that returns the third transaction for every user, based on transaction date order (oldest -> newest). Only users who have at least three transactions should appear in the output.

transactions Table

Column Name	Type
user_id	integer
spend	decimal
transaction_date	timestamp

Example Input

user_id	spend	transaction_date
111	88.50	01/08/2024 12:30:00
111	54.00	01/10/2024 12:05:00
121	46.00	01/18/2024 12:10:00
145	24.99	01/26/2024 12:20:00
111	95.50	02/05/2024 12:50:33

Expected Output

user_id	spend	transaction_date
111	95.50	02/05/2024 12:50:33

Answers

1. Beginner-Friendly Approach (ROW_NUMBER Subquery)

This approach uses a window function to assign a ranking number to each transaction based on the date. For each individual user, their oldest transaction becomes 1, next becomes 2, next becomes 3 and so on.

Once every transaction is numbered, we simply pick the rows where the rank is 3, because that means it's the third transaction for that user.

SELECT
  user_id,
  spend,
  transaction_date
FROM (
    SELECT
        user_id,
        spend,
        transaction_date,
        ROW_NUMBER() OVER (
            PARTITION BY user_id
            ORDER BY transaction_date
        ) AS rn
    FROM transactions
) AS t
WHERE rn = 3;

PARTITION BY user_id : Restart ranking for each user. This ensures each user’s transactions are ranked separately.
ORDER BY transaction_date : This sorts the user’s transactions in chronological order, oldest first.
ROW_NUMBER() : Assigns numbers 1, 2, 3, … to transactions within each user’s group.
Outer query filters WHERE rn = 3 : This returns only the third transaction for each user.

2. Using CTE (Common Table Expression)

The CTE version applies the same logic but organizes it inside a WITH clause, which makes the query easier to read, maintain and extend especially when working with large datasets or more complex requirements.

WITH ranked AS (
    SELECT
        user_id,
        spend,
        transaction_date,
        ROW_NUMBER() OVER (
            PARTITION BY user_id
            ORDER BY transaction_date
        ) AS rn
    FROM transactions
)
SELECT user_id, spend, transaction_date
FROM ranked
WHERE rn = 3;

Use of a CTE (WITH ranked AS …) : Makes it clearer what “ranking per user” means and keeps the final SELECT clean.
Window function stays the same : ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY transaction_date) is still the best way to label transaction order.
Filtering on rn = 3 : The final result now cleanly selects exactly the third transaction.
Index Optimization : Adding an index on (user_id, transaction_date) helps the database efficiently locate all rows for a specific user and sort them by date. This significantly reduces the work required for sorting and partitioning, especially when dealing with large datasets.

In this case, whether I use a subquery or a CTE, the logic and execution plan are effectively the same. The main performance factor is the window function over (PARTITION BY user_id ORDER BY transaction_date) and whether we support that with a good index like (user_id, transaction_date). So in terms of speed, both approaches are equal, the choice between a CTE and a subquery mainly depends on which one makes the query clearer and easier to maintain.

4. Find AWS Server Utilization Time

Amazon Web Services (AWS) runs on large fleets of servers. Each server can be started and stopped multiple times, and management wants to understand how long, in total, the servers were running.

You are given a table server_utilization that logs when each server starts or stops

server_utilization Table

Column Name	Type
server_id	integer
status_time	timestamp
session_status	string

Example Input

server_id	status_time	session_status
1	08/02/2022 10:00:00	start
1	08/04/2022 10:00:00	stop
2	08/17/2022 10:00:00	start
2	08/24/2022 10:00:00	stop

Example Output

total_uptime_days
9

Answers

Try to write the query yourself before seeing the answer . Hints to think about: For each start, you need the next status_time (the corresponding stop). You can get the “next row” using a window function like LEAD(). For each start -> stop pair, compute the time difference. Sum all sessions’ durations across all servers. Convert the total time into full days. Once you’ve tried, compare with the solution below.

WITH sessions AS (
    SELECT
        server_id,
        status_time,
        session_status,
        LEAD(status_time) OVER (
            PARTITION BY server_id
            ORDER BY status_time
        ) AS next_status_time
    FROM server_utilization
),
running_intervals AS (
    SELECT
        server_id,
        status_time AS start_time,
        next_status_time AS stop_time
    FROM sessions
    WHERE session_status = 'start'
      AND next_status_time IS NOT NULL
)
SELECT
    SUM(TIMESTAMPDIFF(DAY, start_time, stop_time)) total_uptime_days
FROM running_intervals;

Explanation

The first part of the query creates a temporary result called sessions. Inside this step, each event is ordered by time for its server and the LEAD() function is used to look ahead and retrieve the timestamp of the next event. This means that if you are looking at a start event, LEAD() provides the exact time when the same server later stopped. As a result, each start event becomes paired with its corresponding stop event. something like

server_id	status_time	session_status	next_status_time

In the next part, the query filters this list and keeps only the rows where the event is a “start” and there is a valid next event. This produces a clean list of running intervals. For example, if a server started on August 2nd at 10 AM and stopped on August 4th at 10 AM, this becomes one complete interval from which the uptime can be calculated.

The final step calculates the duration of each interval using TIMESTAMPDIFF(DAY, start_time, stop_time), which returns how many full days passed between the start and stop times. After calculating the durations of all these intervals, the query sums all of them into a single number that represents the total number of days the fleet of servers was operating.

Overall, the query works by identifying start–stop pairs, turning them into running intervals, converting those intervals into days and summing everything together to produce the final total uptime in days.

We can rewrite the query using a single CTE while keeping the exact same logic. In this simplified version, the CTE immediately renames status_time as start_time and the LEAD(status_time) value as end_time, while still keeping session_status so we know which rows are start events. That way, when we come out of the CTE, we already have everything we need :- the server id, the start time, the next end time and the session status.

Because of that, the final SELECT can now operate directly on sessions. We simply filter on session_status = 'start' and compute SUM(TIMESTAMPDIFF(DAY, start_time, end_time)) to get the total uptime in days. Logically, this is same to the earlier multi-CTE version. we are still pairing each start with the next event using LEAD, still only considering start rows and still summing the duration between start and end times.

WITH sessions AS (
    SELECT
        server_id,
        status_time AS start_time,
        LEAD(status_time) OVER (
            PARTITION BY server_id
            ORDER BY status_time
        ) AS end_time,
        session_status
    FROM server_utilization
)
SELECT
    SUM(TIMESTAMPDIFF(DAY, start_time, end_time)) AS total_uptime_days
FROM sessions
WHERE session_status = 'start';

5. Find Facebook Pages With No Likes

Facebook hosts millions of Pages that users can follow by clicking the “Like” button. Each page can receive zero, one or many likes from different users. As part of an analytics task, you are required to identify which Pages currently have no likes at all.

pages Table

Column Name	Type
page_id	integer
page_name	varchar

pages Example Input

page_id	page_name
20001	SQL Solutions
20045	Brain Exercises
20701	Tips for Data Analysts

page_likes Table

Column Name	Type
user_id	integer
page_id	integer
liked_date	datetime

page_likes Example Input

user_id	page_id	liked_date
111	20001	04/08/2022 00:00:00
121	20045	03/12/2022 00:00:00
156	20001	07/25/2022 00:00:00

Example Output

page_id
20701

Answers

Try to write the query yourself before checking the answer. Think about: How do we find pages that do not appear in the page_likes table? Should we use a LEFT JOIN or NOT EXISTS? How do we ensure the result is sorted? Once you think it through, compare with the solutions below.

SELECT
    p.page_id
FROM pages p
LEFT JOIN page_likes pl
    ON p.page_id = pl.page_id
WHERE pl.page_id IS NULL
ORDER BY p.page_id;

LEFT JOIN keeps all pages from the pages table.
For pages that have no matching rows in page_likes, the columns from page_likes become NULL.
We check for: `WHERE pl.page_id IS NULL` This gives us pages that nobody has liked.

Alternative Solution : Using NOT IN

Select all pages whose page_id does not appear in the list of pages that have received likes. This filters out every page that appears in page_likes.

SELECT page_id
FROM pages
WHERE page_id NOT IN (
    SELECT page_id FROM page_likes
)
ORDER BY page_id;

This approach checks whether a page’s ID is not present in the list of pages that have received likes.

Although it looks straightforward, it has a major weakness: if the subquery returns even one NULL, the entire comparison fails and no rows are returned. This makes NOT IN unreliable unless you manually filter out NULLs in the subquery. Because of this pitfall, most developers avoid using NOT IN for anti-joins.

Note : If the subquery returns even one NULL, the entire result becomes empty this is a classic SQL pitfall.

Alternative Solution : Using NOT EXISTS (Very Common & Safe)

Use an anti-join with NOT EXISTS to return pages that have no matching like records.

SELECT
    p.page_id
FROM pages p
WHERE NOT EXISTS (
    SELECT 1
    FROM page_likes pl
    WHERE pl.page_id = p.page_id
)
ORDER BY p.page_id;

It is highly reliable because it correctly handles NULLs and stops searching as soon as a matching row is found, making it efficient.

This pattern is widely used at FAANG companies because it clearly expresses intent and scales well with large datasets, especially when page_likes.page_id is indexed.

Final Recommendation

Best: NOT EXISTS - safest, cleanest, best for large datasets
Good: LEFT JOIN … IS NULL - also correct but slightly less clear
Avoid: NOT IN -breaks easily with NULLs and less efficient