Engineering the Next Generation of LinkedIn’s Feed

Published : 17 April 2026

9 Views

#large-language-models-llm

LinkedIn previously relied on five separate systems just to decide which posts should appear in your Feed. One system was responsible for identifying trending content, another focused on collaborative filtering and a third handled embedding-based retrieval to match content more intelligently.

Each of these systems operated independently. They had their own infrastructure, separate teams managing them and distinct optimization strategies. While this setup functioned at scale, it created a fundamental problem. The systems were not designed to evolve together. Whenever the Feed team tried to improve one component, it often led to unintended issues in others, making overall optimization extremely difficult.

To address this, LinkedIn made a bold and unconventional decision. Instead of continuing to manage multiple systems, they removed all five and replaced them with a single unified retrieval model powered by large language models(LLMs). This significantly reduced architectural complexity, but it also introduced a new set of challenges that needed to be solved.

The transition raised several critical questions. One of the first was how to make a large language model understand structured profile data, such as user skills, job history and engagement signals. Another challenge was performance specifically, how to make a transformer-based system deliver predictions in under 50 milliseconds while serving a platform with 1.3 billion users. Finally, there was the issue of training data, where most of the available signals are noisy, consisting largely of passive or ignored interactions.

This article explores how the LinkedIn engineering team approached rebuilding the Feed system from the ground up and the challenges they encountered along the way.

Disclaimer: It is important to note that this discussion is based on publicly shared insights from LinkedIn's engineering team. Please send feedback if you notice any inaccuracies.

The Challenge: Personalization for Over a Billion Users

Every time you open LinkedIn and scroll through your Feed, you are interacting with one of the largest recommendation systems ever built. Behind the scenes, the system must instantly decide which posts to show, balancing content from your connections, people you follow and the broader LinkedIn Economic Graph, which includes suggested and discovery-based content.

Each post you see is the result of a complex decision-making process. The system evaluates signals from your profile—such as your industry, skills, experience and location—and combines them with behavioral signals derived from your interactions over time. These interactions include what you read, like, comment on, revisit or simply scroll past. Just as importantly, the system tracks how these engagement patterns evolve over time.

All of this information is processed in real time to generate a personalized Feed tailored specifically to you. At any given moment, millions of posts are being ranked while the system carefully balances freshness and relevance across a constantly changing pool of content.

Limitations of the Traditional Approach

Historically, LinkedIn relied on multiple retrieval systems working in parallel. These included trending content pipelines, collaborative filtering models and embedding-based retrieval systems. Each system had its own infrastructure, index and optimization strategy.

While this approach enabled the platform to surface diverse content, it also introduced significant engineering complexity. Maintaining multiple systems made it difficult to optimize the Feed holistically. In addition, the ranking model treated each post impression independently, which meant it could not capture the sequential nature of how users actually consume content.

Through experimentation, LinkedIn converged on a new approach: a hybrid system combining a unified LLM-based retrieval layer with a sequential ranking model. This approach not only improves relevance but also uses GPU resources more efficiently.

Unified Retrieval Through Fine-Tuned LLMs

Previously, Feed retrieval depended on a heterogeneous architecture. Content came from multiple sources, including chronological feeds from your network, geographically trending posts, collaborative filtering systems and topic-based retrieval pipelines. While effective, this setup required maintaining separate systems for each retrieval strategy.

The new approach replaces this complexity with a single unified system powered by LLM-generated embeddings. These embeddings are designed to capture a deep understanding of both content and user interests, going far beyond simple keyword matching.

For example, a user interested in electrical engineering who engages with posts about small modular reactors might not be well served by traditional keyword-based systems. However, an LLM-based system understands that these topics are semantically related through concepts like energy systems, infrastructure and power optimization. This deeper understanding comes from the world knowledge encoded in the model.

This capability is particularly valuable in cold-start scenarios. When a new user joins LinkedIn with minimal data, the system can infer likely interests based on profile information such as job title and skills. Unlike traditional systems, it does not need to wait for extensive interaction history to begin delivering relevant content.

At the same time, LinkedIn maintains a strong focus on responsible AI. Models are regularly audited to ensure fairness, so that content from different creators competes equally and the Feed experience remains consistent across audiences.

The key insight here is that LLM embeddings can generalize beyond observed behavior. They allow the system to infer latent interests, making them especially powerful for new users and niche content areas.

From Structured Data to Effective Prompts

One of the major engineering challenges in using LLMs at scale is converting structured data into a format they can understand. LinkedIn addressed this by building a prompt library that transforms structured features into templated text sequences.

For posts, these prompts include information such as format, author details, company, industry, engagement metrics and the content itself. For members, the prompts incorporate profile data, skills, work history, education and a chronological sequence of previously engaged posts.

A critical discovery emerged when handling numerical features. Initially, raw values like views: 12345 were passed directly into the model. However, LLMs treated these numbers as generic tokens, resulting in poor performance. There was almost no correlation between engagement counts and embedding similarity.

To solve this, numerical values were converted into percentile buckets and wrapped in special tokens. For example, instead of views: 12345, the system used a representation like <view_percentile>71</view_percentile>. This allowed the model to understand relative magnitude whether a post was low, average or highly popular.

This change dramatically improved performance, increasing correlation by 30 times and boosting retrieval quality. The same approach was applied to other numerical signals such as engagement rates and recency.

The insight is clear: LLMs do not inherently understand numbers, but with proper encoding, they can learn meaningful representations of magnitude.

Training Dual Encoders at Scale

The retrieval system is built using a dual encoder architecture, where a shared LLM processes both user and content prompts to generate embeddings. These embeddings are then compared using cosine similarity.

Training this system involves contrastive learning with both easy and hard negatives. Easy negatives are randomly sampled posts that were never shown to the user, providing basic contrast. Hard negatives are more interesting. they are posts that were shown but did not receive engagement.

These hard negatives help the model learn subtle distinctions between content that is somewhat relevant and content that is truly valuable. Even adding a small number of hard negatives resulted in noticeable improvements in retrieval performance.

Another important optimization was filtering the user's interaction history. Initially, all viewed posts were included, but this introduced noise and increased computational cost. By focusing only on positively engaged posts, the system improved both efficiency and model quality.

This change reduced memory usage, increased training throughput and accelerated experimentation, ultimately leading to better-performing models.

Online Serving: Freshness at Scale

To deliver real-time recommendations, the system operates through three continuously running pipelines.

The first pipeline generates prompts based on real-time activity, such as new posts or user interactions. These prompts are stored for quick access and also sent for embedding generation.
The second pipeline uses GPU clusters to generate embeddings from these prompts. Updates are batched intelligently to balance efficiency and freshness.
The third pipeline indexes these embeddings using GPU-accelerated nearest neighbor search. When a user opens their Feed, the system retrieves the most relevant posts in under 50 milliseconds, even from millions of candidates.

This architecture demonstrates that freshness and latency are not conflicting goals. By decoupling the system into independent pipelines, each component can optimize for its own requirements while maintaining overall system responsiveness.

Ranking: Understanding the User Journey

While retrieval selects candidate posts, ranking determines what the user actually sees. Traditional ranking models evaluate each post independently, predicting engagement probabilities.

However, this approach ignores the sequential nature of user behavior. In reality, interactions form a narrative over time a professional journey shaped by evolving interests.

To address this, LinkedIn developed a Generative Recommender(GR) model that treats user interactions as a sequence. Instead of isolated predictions, it processes a timeline of past engagements to understand patterns and trajectories.

This allows the system to recognize connections between different topics and anticipate future interests, creating a more coherent and personalized Feed experience.

Teaching Transformers to Recommend

The ranking model uses a transformer architecture with causal attention, ensuring that each interaction is interpreted in the context of previous ones. Posts and user actions are interleaved into a single sequence, capturing both what the user saw and how they responded.

The self-attention mechanism allows the model to dynamically weigh different parts of the sequence. Recent interactions may carry more weight, but older interactions can become relevant again depending on context.

To manage computational cost, certain features such as affinity scores and aggregated counts are added after the transformer stage using a technique called late fusion. This ensures that only features benefiting from sequential modeling are included in the attention mechanism.

The final prediction layer uses a multi-gate mixture-of-experts architecture, enabling the model to optimize for multiple engagement signals simultaneously while sharing a common representation.

Engineering for Production Scale

Deploying such a system at LinkedIn's scale requires overcoming significant infrastructure challenges. Transformer models demand high computational power, particularly GPUs, which must be used efficiently to remain cost-effective.

On the training side, LinkedIn optimized data pipelines, implemented custom CUDA kernels and improved evaluation workflows to accelerate experimentation. On the serving side, they designed a disaggregated architecture separating CPU and GPU workloads.

One key optimization involves computing the user's interaction context once and reusing it across multiple candidate posts. This reduces redundant computation and improves throughput.

Additional innovations, such as custom attention kernels and optimized inference pipelines, ensure that the system can deliver predictions at scale while maintaining low latency.

Conclusion

When you open LinkedIn and discover a highly relevant post—perhaps from someone outside your network. it is the result of a deeply sophisticated system working behind the scenes. Transformer models process thousands of interactions, embeddings capture semantic meaning and optimized infrastructure delivers results in milliseconds.

The new Feed is more than just an incremental improvement. It represents a fundamental shift in how recommendation systems are designed, combining semantic understanding with large-scale engineering to create a more personalized and meaningful experience.

As LinkedIn continues to evolve, these innovations will play a central role in helping professionals discover insights, ideas and opportunities that move their careers forward.