Monday, 23 Dec 2024

Entity-oriented search: The evolution of information retrieval, explained

Learn how modern search engines use entities, context and knowledge graphs to truly understand queries beyond just matching keywords.

We rarely stop to think about the lightning speed of modern information access. Try picturing a time when answers lived only in libraries – it seems archaic now.

Search tools have become so powerful that they grasp the meaning behind your questions, not just the individual words. This capability is the result of an evolution from keyword to entity-oriented search. While it may seem complex, today we are going to break it down.

Think of a simplified world where websites are replaced by books, and answers are found by a team of 1 million dedicated workers. This analogy will help us understand the systems powering entity search, giving you a newfound appreciation for the speed and accuracy we enjoy today.

Through this exercise, you’ll understand:

  • Why search engines started using entities: What problems did they solve?
  • The inner workings of a knowledge graph: How does a search engine populate and use information from the knowledge graph? How can this augment your search results?
  • How can topical authority further augment returned results? 
  • Practical SEO strategies: How to optimize your content for this new landscape.

Let’s build an entity-based search engine: Your library

Imagine you are responsible for a vast library with thousands of books and access to a million diligent workers. Unlike in a normal library, customers want answers to their questions and are not looking for books to read from front to back.

Customers constantly approach with questions (queries), eager for answers. Your mission is to find the information they need as quickly as possible.

For your library to be successful, you’ll need to return better answers that save customers time than other libraries.

Version 1 of your library: Returning based on titles

Let’s imagine someone asks, “how fast is the fastest animal”?

If you were a traditional library you’d begin by scanning titles, hoping for a similarity match. The customer would likely receive a stack of books and it would be their job to read through the books and try to find the answer.

This process may take hours. Not to mention, there could be better books that just don’t get returned because their titles are too unrelated.

Introducing the inverted index

You decide this process is too slow and that this might be a task for your workforce. To accelerate things, you enlist your million-strong workforce to create a comprehensive index.

Instead of focusing on whole books or titles like your original index, they catalog each individual page. Each worker meticulously records every word on a page, along with its location.

The result is what is called an inverted index. The structure looks like this:

Inverted index - Useful application

Now, when a customer asks, “What is the fastest animal?” your team consults the index, pinpoints “fastest” and “animal,” delivering a list of relevant pages and any page that is in both lists.

This mirrors a traditional search engine – we’re finding keywords, but we do not yet understand the deeper meanings.

Now, the customer is getting a list of hundreds to thousands of pages that may contain the answer. This saves the customer much time as they can jump to relevant pages to hopefully find their answer.

Isolating entities: Beyond keywords

Our inverted indexes were a major leap forward, saving time for both your team and customers.

Word of your improved system spreads, and soon, patrons are lining up at the door.

However, complaints start to arise about irrelevant results and factual errors. Striving for excellence, we recognize the need to address these concerns.

Issues

A word like “apple” leads to an overwhelming response – recipes, science, you name it, are all returned. How can we address this?

This is a tricky problem, and we will need to train your workforce on a few different approaches.

The first approach that might make sense is to train the workforce to grasp context to distinguish (disambiguate) between multiple meanings of a word. For example, if “Apple” is followed by “computer” or “iPhone,” it signifies a different entity than when it’s near “pie” or “tree.”

While using contextual clues is a powerful approach, it’s deceptively difficult. Your workforce needs to learn how to identify the subtle cues that reveal an entity’s true meaning within the surrounding text. This is challenging, requiring a nuanced understanding of language and subject matter expertise that machines may take years to replicate.

To effectively employ context in distinguishing word meanings, we must first construct a robust foundation that empowers our workforce to reorganize the index.

Here are the three steps we will achieve and discuss below:

  • The librarian’s guidebook: We need a clear system to help your workers understand context. They must be able to identify different meanings of the same word and file books accordingly by looking at the surrounding words. This means we need a detailed catalog of which surrounding words suggest which entities. To achieve this, we will need to start writing down surrounding words and the entities we think are associated, then compare this to the knowledge graph we build next.
  • Charting the collection: A visual map of these entities and their relationships will be invaluable. Your workers will use this chart to make connections, improving the quality of the books they suggest to patrons. By identifying an entity and traversing its attributes, we can use this information later to augment our whole process.
  • Reorganizing the shelves: Lastly, once we have a knowledge graph, a detailed map of which surrounding words give clues to an entity’s identity, we will need to revamp your library and index. Instead of only relying on traditional terms, we’ll group books by “entities” – the key people, places, things and ideas they discuss.

Step 1: Building the guidebook

Your workforce will be trained on the following three steps to help build clues as to which entity is used in the text:

  • Surrounding words: Just as search engines analyze nearby words, your workforce will look at the sentences around “apple.” Is it similar to words like “pie,” “baking,” or “recipe”? This suggests the culinary apple.
  • Book genre: The book’s overall category offers powerful clues. If it’s a history textbook, “apple” might refer to a historical figure (like Isaac Newton and his apple-inspired discovery). In a science fiction novel, it could even be a futuristic planet!
  • Sentence structure: The workforce will learn to pay attention to how “apple” is used. Is it a noun (“The apple fell.”) or an adjective (“Her cheeks were apple-red.”)? This helps them distinguish between the fruit and other meanings.

Over time, these observations form the foundation of your guidebook. It could include:

  • A list of words with multiple meanings, like “apple.”
  • Common phrases and contexts that signal a specific meaning (e.g., “apple pie” = food).
  • Links to subject-specific dictionaries for in-depth research.

Just like search engines, this system isn’t perfect. The workforce will still encounter ambiguity, but the guidebook dramatically increases their ability to identify the correct entity based on context.

This guidebook can then be used to identify new entities and link existing text to pre-existing entities (called entity-linking).

Step 2: Creating a knowledge base (hint: we won’t build this from scratch) 

Embracing existing knowledge

Building a comprehensive knowledge base from scratch would be a mammoth task. Fortunately, resources like encyclopedias provide a valuable foundation.

Just like Google, we can leverage existing knowledge sources like DBpedia. DBpedia offers well-structured categories and attributes (think of these as specialized tags), giving us a head start in organizing your library’s knowledge.

A key decision to make about your knowledge graph is what are the ontologies. We will try to develop ontologies that correspond to the types of queries we see coming into your library.

Ontologies - The knowledge graph

Entity linking: The art of connection

Next, your tireless workers must transform raw, unstructured information, such as the words on a page into linked knowledge. They’ll re-analyze the library’s books and incoming content, using contextual clues to identify and connect entities to DBpedia’s structure.

Example: Let’s say a page describes a cheetah’s incredible running speed. Your workers might:

  • Recognize “cheetah” as an entity of type “animal.”
  • Link it to DBpedia’s cheetah entry, enriching it with its scientific name, habitat information, etc.
  • Create a “top speed” attribute, assigning the value found on the page.

Let’s quickly go through an example of the entity linking process:

Entity linking process - application

Step 3: The knowledge graph takes shape

Each entity and relationship your team identifies becomes a node and edge in your growing knowledge graph – a visual map of connected information!

This structured format allows us to move beyond simple keyword matching and truly understand the meaning behind text. With the knowledge graph, we can augment our index with entities, not just terms.

Unlike plain text, entities have rich attributes associated with them. This deeper understanding will empower us to analyze unstructured text more effectively, interpret user queries more accurately, and provide highly relevant answers.

Augmenting your search results with entities

Now that your workers have built this massive graph of relationships of information, the next question is how can we use this knowledge graph to augment your answering process?

This is where we begin observing the benefits of building this huge graph.

  • Finally, we’ve solved the “apple” dilemma. Your inverted index can now accommodate multiple meanings of “apple.” We’ll assign each entity a set of aliases, helping us recognize how people refer to “apple” in various contexts. This means even if an author doesn’t use the exact search term, we can still potentially return their relevant content if they use an alias.
  • Using the same method of identifying mapping to entities, we can better understand the question coming in. For example, if someone searches “what year was apple founded,” based on contextual clues, we can link “apple” to the company. Now the returned answers only refer to the company instance of “apple.”
  • Entity traversal to understand customer searches: When a customer asks a question, we first identify the key entities within it. Then, we explore the knowledge graph to pinpoint the precise type of entity they’re interested in. This goes far beyond just matching a city name; we can distinguish between cities, historical figures, or other entities that share the same name. By understanding the entity type and its associated attributes, we gain a deeper insight into the customer’s true intent. This allows us to deliver results that ar
    • re not just textually relevant but genuinely answer the deeper meaning behind the search.
    Example - Searching for 'cities in Ohio'
    • Query expansion: Finally, we can enhance incoming queries with synonyms, attributes, and variations. Previously, if a page didn’t include the exact search terms, it wouldn’t appear in results – even if it was highly relevant. Customers might have missed fantastic content just because they didn’t use the right words. Query expansion helps us bridge this gap, surfacing a wider range of relevant pages.
    Query expansion for improved results

    What this means for SEO

    This highlights a major concept often misunderstood in SEO. Google doesn’t just hunt for exact keywords. It can understand that your page addresses a topic even if the precise keyword isn’t present.

    While it’s still wise to include variations, thanks to entity understanding, well-written pages can organically rank for related terms you haven’t explicitly targeted.

    Further augmenting search results with topical authority: Understanding books and what they are good for

    Imagine a customer asking, “What year did Steve Jobs found Apple?” Your system excels at identifying “Apple” as the company.

    However, it might mistakenly prioritize the book “10 Secret Hacks to Growing Your Business,” simply because it briefly mentions “Steve Jobs founding Apple” on page 93.

    Since we can’t fact-check every book, we might be concerned that a book about business hacks may not be a reliable source of information on Apple. This could hurt your reputation.

    We want customers to find books that spark their interest in further reading about their chosen topic. To solve this, we’ll develop a system that classifies and organizes your books by theme. This way, we can match users’ questions with thematically relevant books.

    Our workforce will analyze both the title and table of contents to determine the book’s focus. We’ll also use your knowledge graph to verify that the topics are accurately related to the user’s search, ensuring the results we provide are relevant and helpful.

    By carefully classifying books using their table of contents, we can pinpoint the specific categories that best serve particular search topics. This lets us prioritize reliable sources of information, giving a boost to books with a proven track record of expertise.

    Linking this back to a search engine, this is the foundation for concepts such as topical authority.

    Identity crisis alert

    Our new system could stumble when encountering books with overly broad topic coverage in their table of contents. For now, we’ll label these “uncategorized” and avoid boosting them in search results, ensuring we don’t mislead customers.

    SEO best practice on entities

    Dealing with new information

    Our indexing team has built a powerful system, and customers love the improved results.

    However, millennials are frustrated when searching for books defining the term “cap” – your system doesn’t recognize this slang usage. It seems Gen Z authors are driving this new language trend, and we need to ensure your system keeps pace with evolving information.

    Knowledge is constantly changing. Therefore, we’ve formed a team dedicated to identifying truly new information – scientific discoveries, groundbreaking inventions, or emerging celebrities.

    Their mission is twofold:

    • Add new entities to your existing knowledge graph.
    • Define new relationships as needed, ensuring your knowledge graph accurately reflects reality.

    Create a structured language for your authors, like schema markup

    Our final step is implementing a new paradigm that will help our library as we progress into the future. Our workers are fantastic, but a million salaries are a burden.

    Let’s empower authors to streamline the process. We’ll create a structured language, similar to Schema markup, that authors can use to clearly communicate key information.

    At the front of every book they can create tables that clearly identify different types of information that are in the book. This will allow our workforce to save time and determine what pages are available without reading them in depth. It will also allow our team to return tables of information to customers instead of pages.

    This shift away from plain text (unstructured data) will make your indexing team’s job much easier, freeing them up to tackle the influx of those exciting new Gen-Z books.

    This saves us time, so we also reward authors who use it with enhanced content and preference on the stack we send to customers. Now, we’ve completed your entity-oriented library!

    Key SEO takeaways from your newfound understanding

    We transformed a traditional library into a lightning-fast information retrieval system. Had we done this 30 years ago, we might be billionaires.

    This simplified example shows how we evolved from basic title matching to a system that truly understands the user’s intent. We even developed a structured language (think of it like schema markup) to streamline information processing. This lets your team quickly grasp a book’s core content, potentially improving how we rank results.

    While we haven’t touched on the complex topic of page scoring (the rank order in which we should send documents back to customers), we’ve achieved something remarkable. We can now pinpoint the most relevant documents, even if they don’t use an exact search term.

    Let’s distill your newfound knowledge into actionable SEO takeaways:

    • Beyond keywords: Google’s knowledge graph understands synonyms and attributes. Optimize with natural language and include terms your audience actually uses, but don’t feel bound by a rigid keyword list.
    • Context is king: Help Google grasp the full scope of your content. Provide clear attributes – whether through well-organized tables or structured data like Schema markup – giving it maximum context for understanding.
    • Schema markup saves search engines like Google time. Using entity schema markup can help disambiguate the words on your page and clarify the important entities, giving Google more trust and likely rewarding your page.