Enhancing Information Retrieval via Semantic and Relevance Matching

11 min readMay 8, 2021

Introduction

Data is the gold of the 21st Century. We are every day creating quintillion bytes of data. Information is the processed and refined form of data that carries a logical meaning. Whether we search on search engines, search for a product on e-commerce websites, any other search for articles, products, peoples etc, IR( Information retrieval) is everywhere and has become an integral part of our daily lives.

Understanding Information retrieval

Information retrieval, in the field of computer science, is defined as, the process of obtaining relevant documents that satisfy the information needs from a large collection of documents. In most popular cases, the information needed is expressed in the form of a query string( Eg: google search) and the information too is a string (Eg: search results on google).

The core task in IR is to first find matching documents for a query(Retrieval stage) and then rank the matched documents(Ranking stage). Matching happens between the query and each document in the collection, since the collection set is very large (in billions), matching logic has to be efficient. Next, ranking is done based on the content relevance of a document for the query, performance metrics of the documents and the user context.

In this article, we are going to talk about algorithms that aim to find the most relevant documents for a query based on their textual content as well as semantic meaning. We will discuss the evolution of matching a document with a query and how to use matching signals in the ranking to increase the relevance of results.

Classification of Matching algorithms

The below diagram captures important algorithms for matching in IR. On a high level, we have the following approaches.

Bag of words model (do textual matching as set intersection operation treating both query and documents as a bag of words)
Word embeddings based ( matching happens after translating the text into embedding space and matching algebraically in that space, some of the popular embedding techniques are word2vec and fasttext. These embeddings are learnt in such a way that semantically similar queries are nearby in embedding space and unrelated queries are far away)

The next level of classification is based on whether the algorithm is neural network based or non-neural network. Neural network based are further classified into representation based( query and documents are first translated to embedding space individually without knowledge of each other and then matched, risk of losing exact match signals)and interaction based( model interaction between query words and document terms explicitly, exact match signals are not lost).

Classification of Matching algorithms in IR

Now we will discuss each one of them in terms of key points. We have tried to cover the breadth and omit implementation details and derive key takeaways as the novel idea in the paper.

Bag of Words Based

Basic Bag of Words Model

This has been state of the art for the last 50–60 years and works quite well
Treat query and documents as a bag of words. For each term in the corpus, create an offline Hashmap of the term to the list of documents that contain the term (Simple description of Inverted Index). During online matching, use this hashmap to get common documents for all query terms.

Term-Document matrix to enable efficient retrieval

Effective( because exact matches are important signals for relevance)and efficient( because of the inverted index, we don’t need to iterate all documents)
shallow way of understanding human languages, disregarding the meaning and context. eg non chinese phone document will match for query chinese phone.
Vocabulary mismatch. Eg: people will not match person
Takeaways: Basic bag of words model is effective and efficient for normal use cases

Pseudo-relevance feedback

References: Manning book, Robust PSR ( 2006)
Aims to increase recall when only a few documents are returned for a query
The assumption is that top documents are relevant to the query. Increase recall by returning results using query + top documents
The top documents may have some non-relevant content which can take the results in altogether different directions. To avoid that, we model potentially different amount of relevant information in each feedback document in an iterative manner ( diagram below)

Takeaways: Increase recall using already known relevant documents for a query

Sequential dependency Model

References: SDM (2014)
Represents a text not only by terms but also by pairs of terms that co-occur within a distance
If terms that are in proximity in the query are also in proximity in a document, then there is strong evidence in favour of relevance.
Using Markov model it models how likely query term qi describes the document and also the importance of terms proximity

Takeaways: Matching of word n-grams is an important signal in matching. Terms appearing contiguously within a query provides different (stronger) evidence about the information need than a set of non-contiguous query terms

Composite Match Autocompletion (COMMA)

References: COMMA (2014)
This algorithm tries to do semantic matching by matching on multiple fields like category and facets
We apply 3 filters for each query
1) documents whose titles match with the query (syntactic matching)
2) documents whose categories match with the query (semantic matching)
3) documents whose facets match with the query (semantic matching)
The final set is the union of all 3. Category and facet match is syntactic matching with prefix terms only ( giving pseudo-semantic matching).
The ranking is done on syntactic relevance and semantic relevance (category match, number of facets match, facets salience)

Takeaways: Category match and facet match signals can help to show more relevant docs at the top. Even when all terms of query don’t match with document title, category match and maximum facet match would be there and it would help increase recall

Latent semantic Analysis

References: Blog, LSI (1990)
If we say that the word pair (snow, winter) occurs together more frequently than (dog, winter) pair, it means that it carries a higher semantic meaning than the (dog, winter) word pair. This is the underlying context of the algorithm.
Each document gets associated with topics and each term gets associated with topics and matching happens in topic space
Then singular Value Decomposition (SVD) factorisation of term-document matrix is done to obtain topics

Takeaways: LSI is unsupervised learning of topics, hence there can be issues with precision. However, it can be used to augment the basic bag of word models where documents containing query term matches as well as a match in topic space would be considered more relevant.

Latent Dirichlet Allocation

References: Blog, LDA (2003)
Learning topic for documents and words in a different way than LSI
Initialise with random topics for each word. In each iteration converge towards the final assignment
LDA has better accuracy than LSI

Semantic similarity with inSession Queries

References: Semantic similarity with InSession Queries (2018)
Considers semantic similarity between the candidate documents and previous documents clicked in the same session, on the basis of the word2vec method (words that share common contexts in the corpus are located close to one another in the space)
Takeaways: In a session, context can be used to better understand the intent of the user and match the documents

Content relevance model on engagement rate

References: Relevance ranking at Twitter ( 2020)
Modelling user feedback can be a good way to improve the relevance

MERCURE System

References: MERCURE (1994)
3 layered network architecture, also capturing inter-term dependency. Early neural network in the field of IR

Deep Structured Semantic Model (DSSM)

References : DSSM (2013), CDSSM (2014)
Neural network based semantic matching of query and documents
Input to neural network: High-dimensional Term vector (One hot encoding to represent terms contained). To reduce the dimensionality of term vectors( from 500 K to 30K), we use one-hot encoding for character trigrams.
Output: concept vector in a low-dimensional semantic feature space

Features and models both are learnt automatically
Query-Document click data from the past is used for training
Using the convolutional neural network has added advantage
Takeaways: DSSM helps to automatically learn semantic matching features. Exact match signals ( which are very useful for relevance matching) are lost before matching. This shortcoming is overcome in interaction based neural models (Eg DRMM) which also models interaction between query terms and document terms to retain exact match signals

Deep Relevance Matching Model

References: DRMM (2017)
The neural network to capture relevance matching
SEMANTIC MATCHING (meaning should be same, treats query and document alike) VS. RELEVANCE MATCHING( Exact matching signals, Query term importance, considers queries are usually smaller than documents)
The matching score order of document with this is generally like this: Exact Match> soft match > weak soft match
For each term in the query and all the terms in a document, it calculates cosine similarity in the embedding space. Eg: Given a query term “car” and a document with terms (car, rent, truck, bump, injunction, runway), and the corresponding local interactions based on cosine similarity are (1, 0.2, 0.7, 0.3, −0.1, 0.1)
Further it creates a histogram by classifying each cosine similarity into one of these bins {[−1, −0.5), [−0.5, −0), [0, 0.5), [0.5, 1), [1, 1]}. Bin [1,1] captures exact match signals. For the previous example, we will obtain a matching histogram as [0, 1, 3, 1, 1]
This histogram pooling aims to separate the exact match from the soft match and to separate strong soft match from the weak soft match.

Takeaways: This captures the relevance of a document for a query. The limitation here is that we are trying to match one term of the query with exactly one term of the document which might not be able to capture all vocabulary match scenarios ( Eg: USA vs United States of America, one term of LHS matches with three terms of RHS)

Doc2Query

References: Doc2Query (2019)
Generates potential queries from documents using neural machine translation and indexes those queries as document expansion terms
Useful for long paragraphs and FAQ kind of searches

K-NRM (Kernel based neural ranking model)

References: K-NRM (2017), K-NRM1 (2018), Conv-K-NRM(2019)
First, we generate word embeddings for query and document
Then Cross-matches query n-grams and document n-grams of variant lengths ( handles shortcoming of DRMM)
Distinguishes useful soft matches from noisy ones using kernel pooling

Semantic Product Search

Reference: Product Search ( 2019)
Generate embeddings for Unigrams+Bigrams+Char Trigrams for both query and document and then do the matching
Handling OOV words using consistent hashing

Generating embedding for document and query

Neural network for semantic product search

DocBert

References: DocBert (2019)
Use contextualised representation of words
Tokens are embedded into embeddings. To further separate the query from the document, segment embeddings ‘Q’ (for query tokens) and ‘D’ (for document tokens) are added to the token embeddings. To capture word order, position embeddings are added. The tokens go through several layers of transformers

Takeaways: All the occurrences of a term are not treated alike. It depends on whether that term is occurring in the document or the query and also on the position at which that term appears. These help to better model the context of a term.

Deep Contextualized Term Weighting framework (DeepCT)

References: DeepCT (2020), HDCT (2020)
Identifying important terms in a long text, useful for passage retrieval
Finding most central words in a text-based on the meaning and not on the word frequency

Takeaways: It helps to find the most central words in a passage so that it matches the document whose central words are present in the query. In the above example, 2nd document is off-topic even if it contains more occurrences of DNA terms.

Summary

The bag of words model has been very effective ( as exact term matching is a very important relevance signal) and efficient ( due to inverted indexing getting documents with matching terms is very fast and don’t need to iterate over the documents)
Bag of word model has certain limitations, it doesn’t understand human language ( document for non chinese phone will match query chinese phone ), doesn’t understand vocabulary mismatch( eg document for puppy will not match document for dog). Term importance is decided by term frequency and not by semantic understanding.
We can increase recall by using a mix of top documents and user-entered query as feedback to return more documents
Modelling phrase importance in the document and term importance in the document ( P(Document/Phrase), P(Document/Term)) can increase the relevance
If terms that are in proximity in the query are also in proximity in the document, then there is strong evidence in favour of relevance.
Matching query not with just document title but with category and facets as well and derive ranking signals for the match in these fields can help to increase the relevance
Difference between semantic matching vs relevance matching. SEMANTIC MATCHING (meaning same, treats query and document alike) VS. RELEVANCE MATCHING( Exact matching signals, Query term importance, queries are usually smaller than documents)
Use user engagement on a document for a query as a measure of relevance.
Various Neural retrieval models(DSSM, DRMM, K-NRM, Semantic search) discussed tries to capture the following:

interaction between query-document terms
interaction between query-document character trigrams
interaction between different length word n-grams of query-document
consider signals for an exact match as well as soft match
consider signals for which are the most important terms of the query and documents

Feedback

Questions? Comments? Contact at: LinkedIn, Instagram