SQL Interview Questions and Answers Asked in FAANG
Preparing for an SQL interview at FAANG (Meta, Amazon, Apple, Netflix, Google) requires far more than knowing basic commands or syntax. These companies expect strong analytical thinking, performance awareness and the ability to write efficient SQL for massive datasets.
This article covers the most frequently asked SQL interview questions and answers from FAANG companies like Meta, Amazon, Apple, Netflix and Google. Whether you're a beginner or an experienced data professional, you’ll learn essential SQL concepts, real-world examples and problem-solving strategies used in high-level tech interviews. Perfect for Data Analysts, Data Engineers, Backend Developers and anyone preparing for advanced SQL interview rounds.
1. Histogram of Tweets
You are given a table that contains Twitter tweet data. Your task is to write a SQL query that creates a histogram of how many tweets each user posted in the year 2024.
In simple terms
- Count how many tweets each user made in 2024.
- Then group users by that tweet count.
- Finally, show how many users fall into each group.
The output should have
- tweet_bucket : number of tweets a user posted in 2024.
- users_num : number of users who posted that many tweets
| Column Name | Type |
|---|---|
| tweet_id | integer |
| user_id | integer |
| msg | string |
| tweet_date | timestamp |
Example Input
| tweet_id | user_id | msg | tweet_date |
|---|---|---|---|
| 214252 | 111 | Am considering taking Tesla private at $420… | 09/05/2025 00:00:00 |
| 739252 | 111 | Despite the constant negative press covfefe | 09/11/2024 00:00:00 |
| 846402 | 111 | Following @NickSinghTech on Twitter changed my life! | 02/04/2024 00:00:00 |
| 241425 | 254 | If the salary is so competitive why won't you tell me… | 07/11/2024 00:00:00 |
| 231574 | 148 | I no longer have a manager. I can't be managed | 08/03/2024 00:00:00 |
Expected Output
| tweet_bucket | users_num |
|---|---|
| 1 | 2 |
| 2 | 1 |
Explanation
- User 111 tweeted two times in 2024.
- Users 254 and 148 each tweeted once in 2024.
So: The “1 tweet” bucket has 2 users and the “2 tweets” bucket has 1 user
Answer
Try to write the query yourself before checking the answer.
First Count tweets per user using inside subquery after that Build the histogram using outer query
SELECT
tweet_count AS tweet_bucket,
COUNT(*) AS users_num
FROM (
-- Count how many tweets each user made in 2024
SELECT
user_id,
COUNT(*) AS tweet_count
FROM tweets
WHERE tweet_date >= '2024-01-01'
AND tweet_date < '2025-01-01'
GROUP BY user_id
) AS user_tweets
GROUP BY tweet_count
ORDER BY tweet_count;- The inner query counts how many tweets each user made in 2024.
- The outer query groups those counts and produces the histogram.
To optimize this query, I focus on two things, reducing the amount of data processed and ensuring the database uses the most efficient access path.
First, I filter the tweets table to only include data from 2024. This dramatically shrinks the dataset early, which helps avoid unnecessary scans on irrelevant years. To support this, I would add a composite or partial index on (user_id, tweet_date) filtered for 2024. This allows the database to jump directly to the needed rows instead of scanning the full table.
Next, I aggregate at the lowest possible level per user. This produces a much smaller intermediate result containing just user_id and their tweet counts. Once I have this compact dataset, generating the histogram simply requires grouping by the tweet count and counting how many users fall into each bucket.
In a high-scale environment like Twitter or Amazon, these steps ensure the query remains performant even when the tweets table contains billions of rows.
WITH user_tweet_counts AS (
SELECT
user_id,
COUNT(*) AS tweet_count
FROM tweets
WHERE tweet_date >= '2024-01-01'
AND tweet_date < '2025-01-01'
GROUP BY user_id
)
SELECT
tweet_count AS tweet_bucket,
COUNT(*) AS users_num
FROM user_tweet_counts
GROUP BY tweet_count
ORDER BY tweet_count;- Filter early: We only look at rows from 2024 in the inner step (CTE).
- Use index effectively: The (user_id, tweet_date) index with the 2024 filter helps both filtering and grouping.
- Small intermediate result: We first reduce data to user_id + tweet_count, then build the histogram from that smaller set.
2. Identify Candidates with All Required Data Science Skills
You’re given a table that lists each candidate along with the skills they have. A company is hiring for a Data Science position and the role requires candidates who are proficient in all three of the following skills Python, Tableau and PostgreSQL.
Write an SQL query to find all the candidates who have all three of these required skills. Make sure the results are ordered by the candidate ID in ascending order.
| Column Name | Type |
|---|---|
| candidate_id | integer |
| skill | varchar |
| candidate_id | skill |
|---|---|
| 123 | Python |
| 123 | Tableau |
| 123 | PostgreSQL |
| 234 | R |
| 234 | PowerBI |
| 234 | SQL Server |
| 345 | Python |
| 345 | Tableau |
| candidate_id |
|---|
| 123 |
- Candidate 123 has all three required skills : Python, Tableau, PostgreSQL -> included.
- Candidate 345 is missing PostgreSQL -> excluded.
- Candidate 234 does not have any of the required Data Science skills -> excluded.
- So only 123 satisfies all the skill requirements.
Answer
First, try to write the query yourself. Before looking at the answers, think through : How do I filter only the required skills?, How do I make sure a candidate has all 3 skills (not just one or two)?, How do I group and count properly?
To solve this, I first focus on the requirement: we only want candidates who have all three specific skills : Python, Tableau and PostgreSQL.
The data is stored in a row-per-skill format, so I filter the table to only those three skills using a WHERE ... IN clause. Then I group by candidate_id and use HAVING COUNT(DISTINCT skill) = 3 to ensure the candidate has all three required skills, not just one or two. Finally, I order by candidate_id as requested.
On a large dataset, I’d also consider adding an index on (skill, candidate_id) so the database can quickly locate relevant rows and perform the grouping efficiently.
SELECT
candidate_id
FROM candidates
WHERE skill IN ('Python', 'Tableau', 'PostgreSQL')
GROUP BY candidate_id
HAVING COUNT(DISTINCT skill) = 3
ORDER BY candidate_id;3. Find the Third Transaction for Each Uber User
You are given a table named transactions that records every spend event made by Uber users. Each row represents one transaction.
Your task is to write an SQL query that returns the third transaction for every user, based on transaction date order (oldest -> newest). Only users who have at least three transactions should appear in the output.
| Column Name | Type |
|---|---|
| user_id | integer |
| spend | decimal |
| transaction_date | timestamp |
| user_id | spend | transaction_date |
|---|---|---|
| 111 | 88.50 | 01/08/2024 12:30:00 |
| 111 | 54.00 | 01/10/2024 12:05:00 |
| 121 | 46.00 | 01/18/2024 12:10:00 |
| 145 | 24.99 | 01/26/2024 12:20:00 |
| 111 | 95.50 | 02/05/2024 12:50:33 |
| user_id | spend | transaction_date |
|---|---|---|
| 111 | 95.50 | 02/05/2024 12:50:33 |
Answers
1. Beginner-Friendly Approach (ROW_NUMBER Subquery)
This approach uses a window function to assign a ranking number to each transaction based on the date. For each individual user, their oldest transaction becomes 1, next becomes 2, next becomes 3 and so on.
Once every transaction is numbered, we simply pick the rows where the rank is 3, because that means it's the third transaction for that user.
SELECT
user_id,
spend,
transaction_date
FROM (
SELECT
user_id,
spend,
transaction_date,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY transaction_date
) AS rn
FROM transactions
) AS t
WHERE rn = 3;- PARTITION BY user_id : Restart ranking for each user. This ensures each user’s transactions are ranked separately.
- ORDER BY transaction_date : This sorts the user’s transactions in chronological order, oldest first.
- ROW_NUMBER() : Assigns numbers 1, 2, 3, … to transactions within each user’s group.
- Outer query filters WHERE rn = 3 : This returns only the third transaction for each user.
2. Using CTE (Common Table Expression)
The CTE version applies the same logic but organizes it inside a WITH clause, which makes the query easier to read, maintain and extend especially when working with large datasets or more complex requirements.
WITH ranked AS (
SELECT
user_id,
spend,
transaction_date,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY transaction_date
) AS rn
FROM transactions
)
SELECT user_id, spend, transaction_date
FROM ranked
WHERE rn = 3;- Use of a CTE (WITH ranked AS …) : Makes it clearer what “ranking per user” means and keeps the final SELECT clean.
- Window function stays the same : ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY transaction_date) is still the best way to label transaction order.
- Filtering on rn = 3 : The final result now cleanly selects exactly the third transaction.
- Index Optimization : Adding an index on (user_id, transaction_date) helps the database efficiently locate all rows for a specific user and sort them by date. This significantly reduces the work required for sorting and partitioning, especially when dealing with large datasets.
In this case, whether I use a subquery or a CTE, the logic and execution plan are effectively the same. The main performance factor is the window function over (PARTITION BY user_id ORDER BY transaction_date) and whether we support that with a good index like (user_id, transaction_date). So in terms of speed, both approaches are equal, the choice between a CTE and a subquery mainly depends on which one makes the query clearer and easier to maintain.
4. Find AWS Server Utilization Time
Amazon Web Services (AWS) runs on large fleets of servers. Each server can be started and stopped multiple times, and management wants to understand how long, in total, the servers were running.
You are given a table server_utilization that logs when each server starts or stops
| Column Name | Type |
|---|---|
| server_id | integer |
| status_time | timestamp |
| session_status | string |
| server_id | status_time | session_status |
|---|---|---|
| 1 | 08/02/2022 10:00:00 | start |
| 1 | 08/04/2022 10:00:00 | stop |
| 2 | 08/17/2022 10:00:00 | start |
| 2 | 08/24/2022 10:00:00 | stop |
| total_uptime_days |
|---|
| 9 |
Answers
Try to write the query yourself before seeing the answer . Hints to think about: For each start, you need the next status_time (the corresponding stop). You can get the “next row” using a window function like LEAD(). For each start -> stop pair, compute the time difference. Sum all sessions’ durations across all servers. Convert the total time into full days. Once you’ve tried, compare with the solution below.
WITH sessions AS (
SELECT
server_id,
status_time,
session_status,
LEAD(status_time) OVER (
PARTITION BY server_id
ORDER BY status_time
) AS next_status_time
FROM server_utilization
),
running_intervals AS (
SELECT
server_id,
status_time AS start_time,
next_status_time AS stop_time
FROM sessions
WHERE session_status = 'start'
AND next_status_time IS NOT NULL
)
SELECT
SUM(TIMESTAMPDIFF(DAY, start_time, stop_time)) total_uptime_days
FROM running_intervals;The first part of the query creates a temporary result called sessions. Inside this step, each event is ordered by time for its server and the LEAD() function is used to look ahead and retrieve the timestamp of the next event. This means that if you are looking at a start event, LEAD() provides the exact time when the same server later stopped. As a result, each start event becomes paired with its corresponding stop event. something like
| server_id | status_time | session_status | next_status_time |
|---|
In the next part, the query filters this list and keeps only the rows where the event is a “start” and there is a valid next event. This produces a clean list of running intervals. For example, if a server started on August 2nd at 10 AM and stopped on August 4th at 10 AM, this becomes one complete interval from which the uptime can be calculated.
The final step calculates the duration of each interval using TIMESTAMPDIFF(DAY, start_time, stop_time), which returns how many full days passed between the start and stop times. After calculating the durations of all these intervals, the query sums all of them into a single number that represents the total number of days the fleet of servers was operating.
Overall, the query works by identifying start–stop pairs, turning them into running intervals, converting those intervals into days and summing everything together to produce the final total uptime in days.
We can rewrite the query using a single CTE while keeping the exact same logic. In this simplified version, the CTE immediately renames status_time as start_time and the LEAD(status_time) value as end_time, while still keeping session_status so we know which rows are start events. That way, when we come out of the CTE, we already have everything we need :- the server id, the start time, the next end time and the session status.
Because of that, the final SELECT can now operate directly on sessions. We simply filter on session_status = 'start' and compute SUM(TIMESTAMPDIFF(DAY, start_time, end_time)) to get the total uptime in days. Logically, this is same to the earlier multi-CTE version. we are still pairing each start with the next event using LEAD, still only considering start rows and still summing the duration between start and end times.
WITH sessions AS (
SELECT
server_id,
status_time AS start_time,
LEAD(status_time) OVER (
PARTITION BY server_id
ORDER BY status_time
) AS end_time,
session_status
FROM server_utilization
)
SELECT
SUM(TIMESTAMPDIFF(DAY, start_time, end_time)) AS total_uptime_days
FROM sessions
WHERE session_status = 'start';5. Find Facebook Pages With No Likes
Facebook hosts millions of Pages that users can follow by clicking the “Like” button. Each page can receive zero, one or many likes from different users. As part of an analytics task, you are required to identify which Pages currently have no likes at all.
| Column Name | Type |
|---|---|
| page_id | integer |
| page_name | varchar |
| page_id | page_name |
|---|---|
| 20001 | SQL Solutions |
| 20045 | Brain Exercises |
| 20701 | Tips for Data Analysts |
| Column Name | Type |
|---|---|
| user_id | integer |
| page_id | integer |
| liked_date | datetime |
| user_id | page_id | liked_date |
|---|---|---|
| 111 | 20001 | 04/08/2022 00:00:00 |
| 121 | 20045 | 03/12/2022 00:00:00 |
| 156 | 20001 | 07/25/2022 00:00:00 |
| page_id |
|---|
| 20701 |
Answers
Try to write the query yourself before checking the answer. Think about: How do we find pages that do not appear in the page_likes table? Should we use a LEFT JOIN or NOT EXISTS? How do we ensure the result is sorted? Once you think it through, compare with the solutions below.
SELECT
p.page_id
FROM pages p
LEFT JOIN page_likes pl
ON p.page_id = pl.page_id
WHERE pl.page_id IS NULL
ORDER BY p.page_id;- LEFT JOIN keeps all pages from the pages table.
- For pages that have no matching rows in page_likes, the columns from page_likes become NULL.
- We check for: `WHERE pl.page_id IS NULL` This gives us pages that nobody has liked.
Select all pages whose page_id does not appear in the list of pages that have received likes. This filters out every page that appears in page_likes.
SELECT page_id
FROM pages
WHERE page_id NOT IN (
SELECT page_id FROM page_likes
)
ORDER BY page_id;This approach checks whether a page’s ID is not present in the list of pages that have received likes.
Although it looks straightforward, it has a major weakness: if the subquery returns even one NULL, the entire comparison fails and no rows are returned. This makes NOT IN unreliable unless you manually filter out NULLs in the subquery. Because of this pitfall, most developers avoid using NOT IN for anti-joins.
Note : If the subquery returns even one NULL, the entire result becomes empty this is a classic SQL pitfall.
Use an anti-join with NOT EXISTS to return pages that have no matching like records.
SELECT
p.page_id
FROM pages p
WHERE NOT EXISTS (
SELECT 1
FROM page_likes pl
WHERE pl.page_id = p.page_id
)
ORDER BY p.page_id;It is highly reliable because it correctly handles NULLs and stops searching as soon as a matching row is found, making it efficient.
This pattern is widely used at FAANG companies because it clearly expresses intent and scales well with large datasets, especially when page_likes.page_id is indexed.
- Best: NOT EXISTS - safest, cleanest, best for large datasets
- Good: LEFT JOIN … IS NULL - also correct but slightly less clear
- Avoid: NOT IN -breaks easily with NULLs and less efficient
