Publications

Pre-training vs. Fine-tuning: A Reproducibility Study on Dense Retrieval Knowledge Acquisition Reproduce

Published in Accepted SIGIR-2025, 2025

Abstract

Dense retrievers utilize pre-trained backbone language models (e.g., BERT, LLaMA) that are fine-tuned via contrastive learning to perform the task of encoding text into sense representations that can be then compared via a shallow similarity operation, e.g. inner product. Recent research has questioned the role of fine-tuning vs. that of pre-training within dense retrievers, specifically arguing that retrieval knowledge is primarily gained during pre-training, meaning knowledge not acquired during pre-training cannot be sub-sequentially acquired via fine-tuning. We revisit this idea here as the claim was only studied in the context of a BERT-based encoder using DPR as representative dense retriever. We extend the previous analysis by testing other representation approaches (comparing the use of CLS tokens with that of mean pooling), backbone architectures (encoder-only BERT vs. decoder-only LLaMA), and additional datasets (MSMARCO in addition to Natural Questions). Our study confirms that in DPR tuning, pre-trained knowledge underpins retrieval performance, with fine-tuning primarily adjusting neuron activation rather than reorganizing knowledge. However, this pattern does not hold universally, such as in mean-pooled (Contriever) and decoder-based (LLaMA) models. We ensure full reproducibility and make our implementation publicly available at https://github.com/ielab/DenseRetriever-Knowledge-Acquisition.

Recommended citation: Zheng Yao, Shuai Wang and Guido Zuccon. 2025. Pre-training vs. Fine-tuning: A Reproducibility Study on Dense Retrieval Knowledge Acquisition (Accepted SIGIR-2025). https://arxiv.org/abs/2505.07166

Reassessing Large Language Model Boolean Query Generation for Systematic Reviews Reproduce

Published in Accepted SIGIR-2025, 2025

Abstract

Systematic reviews are comprehensive literature reviews that address highly focused research questions and represent the highest form of evidence in medicine. A critical step in this process is the development of complex Boolean queries to retrieve relevant literature. Given the difficulty of manually constructing these queries, recent efforts have explored Large Language Models (LLMs) to assist in their formulation. One of the first studies,Wang et al., investigated ChatGPT for this task, followed by Staudinger et al., which evaluated multiple LLMs in a reproducibility study. However, the latter overlooked several key aspects of the original work, including (i) validation of generated queries, (ii) output formatting constraints, and (iii) selection of examples for chain-of-thought (Guided) prompting. As a result, its findings diverged significantly from the original study. In this work, we systematically reproduce both studies while addressing these overlooked factors. Our results show that query effectiveness varies significantly across models and prompt designs, with guided query formulation benefiting from well-chosen seed studies. Overall, prompt design and model selection are key drivers of successful query formulation. Our findings provide a clearer understanding of LLMs’ potential in Boolean query generation and highlight the importance of model- and prompt-specific optimisations. The complex nature of systematic reviews adds to challenges in both developing and reproducing methods but also highlights the importance of reproducibility studies in this domain.

Recommended citation: Shuai Wang, Harrisen Scells, Bevan Koopman and Guido Zuccon. 2025. Reassessing Large Language Model Boolean Query Generation for Systematic Reviews. (Accepted SIGIR-2025). https://arxiv.org/abs/2505.07155

2D Matryoshka Training for Information Retrieval Reproduce

Published in Accepted SIGIR-2025, 2025

Abstract

2D Matryoshka Training is an advanced embedding representation training approach designed to train an encoder model simultaneously across various layer-dimension setups. This method has demonstrated higher effectiveness in Semantic Text Similarity (STS) tasks over traditional training approaches when using sub-layers for embeddings. Despite its success, discrepancies exist between two published implementations, leading to varied comparative results with baseline models. In this reproducibility study, we implement and evaluate both versions of 2D Matryoshka Training on STS tasks and extend our analysis to retrieval tasks. Our findings indicate that while both versions achieve higher effectiveness than traditional Matryoshka training on sub-dimensions, and traditional full-sized model training approaches, they do not outperform models trained separately on specific sub-layer and sub-dimension setups. Moreover, these results generalize well to retrieval tasks, both in supervised (MSMARCO) and zero-shot (BEIR) settings. Further explorations of different loss computations reveals more suitable implementations for retrieval tasks, such as incorporating full-dimension loss and training on a broader range of target dimensions. Conversely, some intuitive approaches, such as fixing document encoders to full model outputs, do not yield improvements. Our reproduction code is available at https://github.com/ielab/2DMSE-Reproduce

Recommended citation: Shuai Wang, Shengyao Zhuang, Bevan Koopman and Guido Zuccon. 2025. 2D Matryoshka Training for Information Retrieval. (Accepted SIGIR-2025). https://arxiv.org/abs/2411.17299

Corpus Subsampling: Estimating the Effectiveness of Neural Retrieval Models on Large Corpora Long

Published in ECIR-2025, 2025

Abstract

Due to their low efficiency, neural retrieval models are usually evaluated on small corpora (e.g. MS MARCO or BEIR subsets) or in re-ranking scenarios using a more efficient first-stage retriever. To estimate their effectiveness on larger corpora independently of a first-stage retriever, we propose a new corpus subsampling strategy based on the top-k results of the pooled systems that contributed to the relevance judgments of a corpus. Our experiments on nine TREC tasks covering different corpus sizes show that using the top-1,000 or even only the top-100 pools provides a reliable effectiveness estimate for neural models. This reduces the required experimental resources for large corpora by a factor of up to 1,000 and enables a “green” IR evaluation.

Recommended citation: Maik Fröbe, Andrew Parry, Harrisen Scells, Shuai Wang, Shengyao Zhuang, Guido Zuccon, Martin Potthast and Matthias Hagen. 2025. Corpus Subsampling: Estimating the Effectiveness of Neural Retrieval Models on Large Corpora. In: Hauff, C., et al. Advances in Information Retrieval. ECIR 2025. Lecture Notes in Computer Science, vol 15572. Springer, Cham. https://doi.org/10.1007/978-3-031-88708-6_29. https://link.springer.com/chapter/10.1007/978-3-031-88708-6_29

Starbucks: Improved Training for 2D Matryoshka Embeddings Long

Published in Arxiv, 2024

Abstract

Effective approaches that can scale embedding model depth (i.e. layers) and embedding size allow for the creation of models that are highly scalable across different computational resources and task requirements. While the recently proposed 2D Matryoshka training approach can efficiently produce a single embedding model such that its sub-layers and sub-dimensions can measure text similarity, its effectiveness is significantly worse than if smaller models were trained separately. To address this issue, we propose Starbucks, a new training strategy for Matryoshka-like embedding models, which encompasses both the fine-tuning and pre-training phases. For the fine-tuning phase, we discover that, rather than sampling a random sub-layer and sub-dimensions for each training steps, providing a fixed list of layer-dimension pairs, from small size to large sizes, and computing the loss across all pairs significantly improves the effectiveness of 2D Matryoshka embedding models, bringing them on par with their separately trained counterparts. To further enhance performance, we introduce a new pre-training strategy, which applies masked autoencoder language modelling to sub-layers and sub-dimensions during pre-training, resulting in a stronger backbone for subsequent fine-tuning of the embedding model. Experimental results on both semantic text similarity and retrieval benchmarks demonstrate that the proposed pre-training and fine-tuning strategies significantly improved the effectiveness over 2D Matryoshka models, enabling Starbucks models to perform more efficiently and effectively than separately trained models.

Recommended citation: Shengyao Zhuang*, Shuai Wang*, Bevan Koopman and Guido Zuccon. 2024. Starbucks: Improved Training for 2D Matryoshka Embeddings. (Arxiv Preprint). https://arxiv.org/abs/2410.13230

Context Embeddings for Efficient Answer Generation in RAG Long

Published in WSDM2025, 2024

Abstract

Retrieval-Augmented Generation (RAG) allows overcoming the limited knowledge of LLMs by extending the input with external information. As a consequence, the contextual inputs to the model become much longer which slows down decoding time directly translating to the time a user has to wait for an answer. We address this challenge by presenting COCOM, an effective context compression method, reducing long contexts to only a handful of Context Embeddings speeding up the generation time by a large margin. Our method allows for different compression rates trading off decoding time for answer quality. Compared to earlier methods, COCOM allows for handling multiple contexts more effectively, significantly reducing decoding time for long inputs. Our method demonstrates a speed-up of up to 5.69 × while achieving higher performance compared to existing efficient context compression methods.

Recommended citation: David Rau*, Shuai Wang*, Hervé Déjean and Stéphane Clinchant. 2024. Context Embeddings for Efficient Answer Generation in RAG. (Accepted in WSDM2025). https://arxiv.org/abs/2407.09252

BERGEN: A Benchmarking Library for Retrieval-Augmented Generation Resource

Published in EMNLP2024, 2024

Abstract

Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. In response to the recent popularity of generative LLMs, many RAG approaches have been proposed, which involve an intricate number of different configurations such as evaluation datasets, collections, metrics, retrievers, and LLMs. Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline. In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments. In an extensive study focusing on QA, we benchmark different state-of-the-art retrievers, rerankers, and LLMs. Additionally, we analyze existing RAG metrics and datasets. Our open-source library BERGEN is available under https://github.com/naver/bergen

Recommended citation: David Rau, Hervé Déjean, Nadezhda Chirkova, Thibault Formal, Shuai Wang, Vassilina Nikoulina and Stéphane Clinchant. 2024. BERGEN: A Benchmarking Library for Retrieval-Augmented Generation. (Accepted in EMNLP2024 Findings). https://arxiv.org/abs/2407.01102

Zero-shot Generative Large Language Models for Systematic Review Screening Automation Long

Published in ECIR-2025, 2024

Abstract

We provide a systematic understanding of the impact of specific components and wordings used in prompts on the effectiveness of rankers based on zero-shot Large Language Models (LLMs). Several zero-shot ranking methods based on LLMs have recently been proposed. Among many aspects, methods differ across (1) the ranking algorithm they implement, e.g., pointwise vs. listwise, (2) the backbone LLMs used, e.g., GPT3.5 vs. FLAN-T5, (3) the components and wording used in prompts, e.g., the use or not of role-definition (role-playing) and the actual words used to express this. It is currently unclear whether performance differences are due to the underlying ranking algorithm, or because of spurious factors such as better choice of words used in prompts. This confusion risks to undermine future research. Through our large-scale experimentation and analysis, we find that ranking algorithms do contribute to differences between methods for zero-shot LLM ranking. However, so do the LLM backbones – but even more importantly, the choice of prompt components and wordings affect the ranking. In fact, in our experiments, we find that, at times, these latter elements have more impact on the ranker’s effectiveness than the actual ranking algorithms, and that differences among ranking methods become more blurred when prompt variations are considered.

Recommended citation: Shuoqi Sun, Shengyao Zhuang, Shuai Wang and Guido Zuccon. 2024. Zero-shot Generative Large Language Models for Systematic Review Screening Automation. (Accepted in ECIR 2025). https://arxiv.org/pdf/2406.14117

Large Language Models for Stemming: Promises, Pitfalls and Failures Short

Published in SIGIR-2024, 2024

Abstract

Text stemming is a natural language processing technique that is used to reduce words to their base form, also known as the root form. The use of stemming in IR has been shown to often improve the effectiveness of keyword-matching models such as BM25. However, traditional stemming methods, focusing solely on individual terms, overlook the richness of contextual information. Recognizing this gap, in this paper, we investigate the promising idea of using large language models (LLMs) to stem words by leveraging its capability of context understanding. With this respect, we identify three avenues, each characterised by different trade-offs in terms of computational cost, effectiveness and robustness : (1) use LLMs to stem the vocabulary for a collection, i.e., the set of unique words that appear in the collection (vocabulary stemming), (2) use LLMs to stem each document separately (contextual stemming), and (3) use LLMs to extract from each document entities that should not be stemmed, then use vocabulary stemming to stem the rest of the terms (entity-based contextual stemming). Through a series of empirical experiments, we compare the use of LLMs for stemming with that of traditional lexical stemmers such as Porter and Krovetz for English text. We find that while vocabulary stemming and contextual stemming fail to achieve higher effectiveness than traditional stemmers, entity-based contextual stemming can achieve a higher effectiveness than using Porter stemmer alone, under specific conditions.

Recommended citation: Shuai Wang, Shengyao Zhuang and Guido Zuccon. 2024. Large Language Models for Stemming: Promises, Pitfalls and Failures. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024). https://arxiv.org/abs/2402.11757

Evaluating Generative Ad Hoc Information Retrieval Long

Published in SIGIR-2024, 2024

Abstract

Recent advances in large language models have enabled the development of viable generative information retrieval systems. A generative retrieval system returns a grounded generated text in response to an information need instead of the traditional document ranking. Quantifying the utility of these types of responses is essential for evaluating generative retrieval systems. As the established evaluation methodology for ranking-based ad hoc retrieval may seem unsuitable for generative retrieval, new approaches for reliable, repeatable, and reproducible experimentation are required. In this paper, we survey the relevant information retrieval and natural language processing literature, identify search tasks and system architectures in generative retrieval, develop a corresponding user model, and study its operationalization. This theoretical analysis provides a foundation and new insights for the evaluation of generative ad hoc retrieval systems.

Recommended citation: Lukas Gienapp, Harrisen Scells, Niklas Deckers, Janek Bevendorff, Shuai Wang, Johannes Kiesel, Shahbaz Syed, Maik Fröbe, Guido Zuccon, Benno Stein, Matthias Hagen and Martin Potthast. 2024. Evaluating Generative Ad Hoc Information Retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024). https://arxiv.org/abs/2311.04694

FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation Resource

Published in SIGIR-2024, 2024

Abstract

Federated search systems aggregate results from multiple search engines, selecting appropriate sources to enhance result quality and align with user intent. With the increasing uptake of Retrieval-Augmented Generation (RAG) pipelines, federated search can play a pivotal role in sourcing relevant information across heterogeneous data sources to generate informed responses. However, existing datasets, such as those developed in the past TREC FedWeb tracks, predate the RAG paradigm shift and lack representation of modern information retrieval challenges. To bridge this gap, we present FeB4RAG, a novel dataset specifically designed for federated search within RAG frameworks. This dataset, derived from 16 sub-collections of the widely used \beir benchmarking collection, includes 790 information requests (akin to conversational queries) tailored for chatbot applications, along with top results returned by each resource and associated LLM-derived relevance judgements. Additionally, to support the need for this collection, we demonstrate the impact on response generation of a high quality federated search system for RAG compared to a naive approach to federated search. We do so by comparing answers generated through the RAG pipeline through a qualitative side-by-side comparison. Our collection fosters and supports the development and evaluation of new federated search methods, especially in the context of RAG pipelines.

Recommended citation: Shuai Wang, Ekaterina Khramtsova, Shengyao Zhuang and Guido Zuccon. 2024. FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024). https://arxiv.org/abs/2402.11891

ReSLLM: Large Language Models are Strong Resource Selectors for Federated Search Short

Published in Arxiv, 2024

Abstract

Federated search, which involves integrating results from multiple independent search engines, will become increasingly pivotal in the context of Retrieval-Augmented Generation pipelines empowering LLM-based applications such as chatbots. These systems often distribute queries among various search engines, ranging from specialized (e.g., PubMed) to general (e.g., Google), based on the nature of user utterances. A critical aspect of federated search is resource selection - the selection of appropriate resources prior to issuing the query to ensure high-quality and rapid responses, and contain costs associated with calling the external search engines. However, current SOTA resource selection methodologies primarily rely on feature-based learning approaches. These methods often involve the labour intensive and expensive creation of training labels for each resource. In contrast, LLMs have exhibited strong effectiveness as zero-shot methods across NLP and IR tasks. We hypothesise that in the context of federated search LLMs can assess the relevance of resources without the need for extensive predefined labels or features. In this paper, we propose ReSLLM. Our ReSLLM method exploits LLMs to drive the selection of resources in federated search in a zero-shot setting. In addition, we devise an unsupervised fine tuning protocol, the Synthetic Label Augmentation Tuning (SLAT), where the relevance of previously logged queries and snippets from resources is predicted using an off-the-shelf LLM and then in turn used to fine-tune ReSLLM with respect to resource selection. Our empirical evaluation and analysis details the factors

Recommended citation: Shuai Wang, Shengyao Zhuang, Bevan Koopman and Guido Zuccon. 2024. ReSLLM: Large Language Models are Strong Resource Selectors for Federated Search. (Accepted in WWW2025). https://arxiv.org/pdf/2401.17645

Zero-shot Generative Large Language Models for Systematic Review Screening Automation Long

Published in ECIR-2024, 2023

Abstract

Systematic reviews are crucial for evidence-based medicine as they comprehensively analyse published research findings on specific questions. Conducting such reviews is often resource- and time-intensive, especially in the screening phase where abstracts of publications are assessed for inclusion in a review. This study investigates the effectiveness of using zero-shot large language models (LLMs) for automatic screening. We evaluate the effectiveness of eight different LLMs, and investigate a calibration technique that uses a predefined recall threshold to determine whether a publication should be included in a systematic review. Our comprehensive evaluation using five standard test collections shows that instruction fine-tuning plays an important role in screening, that calibration renders LLMs practical for achieving a targeted recall, and that combining both with an ensemble of zero-shot models saves significant screening time compared to state-of-the-art approaches

Recommended citation: Shuai Wang, Harrisen Scells, Shengyao Zhuang, Martin Potthast, Bevan Koopman and Guido Zuccon. 2023. Zero-shot Generative Large Language Models for Systematic Review Screening Automation. In Proceedings of the 46th European Conference on Information Retrieval (ECIR 2024). https://link.springer.com/chapter/10.1007/978-3-031-56027-9_25

Generating Natural Language Queries for More Effective Systematic Review Screening Prioritisation Long

Published in SIGIR-AP-2023, 2023

Abstract

Screening prioritisation in medical systematic reviews aims to rank the set of documents retrieved by complex Boolean queries. The goal is to prioritise the most important documents so that subsequent review steps can be carried out more efficiently and effectively. The current state of the art uses the final title of the review to rank documents using BERT-based neural neural rankers. However, the final title is only formulated at the end of the review process, which makes this approach impractical as it relies on ex post facto in- formation. At the time of screening, only a rough working title is available, with which the BERT-based ranker achieves is signifi- cantly worse than the final title. In this paper, we explore alternative sources of queries for screening prioritisation, such as the Boolean query used to retrieve the set of documents to be screened, and queries generated by instruction-based generative large language models such as ChatGPT and Alpaca. Our best approach is not only practical based on the information available at screening time, but is similar in effectiveness with the final title.

Recommended citation: Shuai Wang, Harrisen Scells, Martin Potthast, Bevan Koopman and Guido Zuccon. 2023. Generating Natural Language Queries for More Effective Systematic Review Screening Prioritisation. In Proceedings of the international ACM SIGIR Conference on Information Retrieval in the Asia Pacific November 26-29, 2023 (SIGIR-AP 2023). https://arxiv.org/abs/2309.05238

Can ChatGPT Write a Good Boolean Query for Systematic Review Literature Search? Long

Published in Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023), 2023

Abstract

Systematic reviews are comprehensive reviews of the literature for a highly focused research question. These reviews are often treated as the highest form of evidence in evidence-based medicine, and are the key strategy to answer research questions in the medical field. To create a high-quality systematic review, complex Boolean queries are often constructed to retrieve studies for the review topic. However, it often takes a long time for systematic review researchers to construct a high quality systematic review Boolean query, and often the resulting queries are far from effective. Poor queries may lead to biased or invalid reviews, because they missed to retrieve key evidence, or to extensive increase in review costs, because they retrieved too many irrelevant studies. Recent advances in Transformer-based generative models have shown great potential to effectively follow instructions from users and generate answers based on the instructions being made. In this paper, we investigate the effectiveness of the latest of such models, ChatGPT, in generating effective Boolean queries for systematic review literature search. Through a number of extensive experiments on standard test collections for the task, we find that ChatGPT is capable of generating queries that lead to high search precision, although trading-off this for recall. Overall, our study demonstrates the potential of ChatGPT in generating effective Boolean queries for systematic review literature search. The ability of ChatGPT to follow complex instructions and generate queries with high precision makes it a valuable tool for researchers conducting systematic reviews, particularly for rapid reviews where time is a constraint and often trading-off higher precision for lower recall is acceptable.

Recommended citation: Shuai Wang, Harrisen Scells, Bevan Koopman and Guido Zuccon. 2023. Can ChatGPT Write a Good Boolean Query for Systematic Review Literature Search? In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023). https://dl.acm.org/doi/10.1145/3539618.3591703#:~:text=In%20this%20paper%2C%20we%20investigate,to%20generate%20effective%20Boolean%20queries.

Balanced Topic Aware Sampling for Effective Dense Retriever: A Reproducibility Study Reproduce

Published in Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023), 2023

Abstract

Knowledge distillation plays a key role in boosting the effectiveness of rankers based on pre-trained language models (PLMs); this is achieved using an effective but inefficient large model to teach a more efficient student model. In the context of knowledge distillation for a student dense passage retriever, the balanced topic-aware sampling method has been shown to provide state-of-the-art effectiveness. This method intervenes on the creation of the training batches by creating batches that contain positive-negative pairs of passages from the same topic, and balance the pairwise margins of the positive and negative passages. In this paper we reproduce the balanced topic-aware sampling method; we do so for both the dataset used for evaluation in the original work (MS MARCO) and for a dataset in a different domain, that of product search (Amazon shopping queries dataset) to study whether the original results generalize to a different context. We show that while we could not replicate the exact results from the original paper, we do confirm the original findings in terms of trends: balanced topic-aware sampling indeed leads to highly effective dense retrievers. These results partially generalize to the other search task we investigate, product search: although we observe the improvements are less significant compared to MS MARCO. In addition to reproducing the original results and study how the method generalizes to a different dataset, we also investigate a key aspect that influences the effectiveness of the method: the use of a hard margin threshold for negative sampling. This aspect was not studied in the original paper. With respect to hard margins, we find that while setting different hard margin values significantly influences the effectiveness of the student model, this impact is dataset-dependent – and indeed it does depend on the score distributions exhibited by retrieval models on the dataset at hand. Our reproducibility code is available anonymously, and will be published later on.

Recommended citation: Shuai Wang, and Guido Zuccon. 2023. Balanced Topic Aware Sampling for Effective Dense Retriever: A Reproducibility Study. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023). https://dl.acm.org/doi/abs/10.1145/3539618.3591915

MeSH Suggester: A Library and System for MeSH Term Suggestion for Systematic Review Boolean Query Construction Short

Published in The 16th ACM International Web Search and Data Mining Conference, 2023

Abstract

Boolean query construction is often critical for medical systematic review literature search. To create an effective Boolean query, sys- tematic review researchers typically spend weeks coming up with effective query terms and combinations. One challenge to creating an effective systematic review Boolean query is the selection of effective MeSH Terms to include in the query. In our previous work, we created neural MeSH term suggestion methods and compared them to state-of-the-art MeSH term suggestion methods. We found neural MeSH term suggestion methods to be highly effective.

Recommended citation: Shuai Wang and Hang Li and Guido Zuccon. 2023. MeSH Suggester: A Library and System for MeSH Term Suggestion for Systematic Review Boolean Query Construction. In the 16th Web Search and Data Mining Conference WSDM 2023 (WSDM2023). https://dl.acm.org/doi/abs/10.1145/3539597.3573025

Neural Rankers for Effective Screening Prioritization in Medical Systematic Review Literature Search Long

Published in Australasian Document Computing Symposium (ADCS 2022, to appear), 2022

Abstract

Medical systematic reviews typically require that all the documents retrieved by a search are assessed. The reason is two-fold: the task aims for “total recall”; and documents retrieved using Boolean search are an unordered set and thus it is unclear how an assessor could examine only a subset. Screening prioritisation is the process of actually ranking the (unordered) set of retrieved documents, allowing assessors to begin the downstream processes of the systematic review creation earlier, leading to an earlier completion of the review; or even to avoid assesses documents ranked least relevant. Screening prioritisation requires an highly effective ranking methods. Pre-trained language models are the state-of-the-art on many IR tasks, but have not been applied to the specific task of systematic review screening prioritisation. In this paper, we apply several pre-trained language models on the systematic review document ranking task, both directly and fine-tuned. An empirical analysis compares how effective neural methods compare to traditional methods for this task. We also investigate different types of document representations for neural methods and their impact on ranking performance. Our results show that BERT-based rankers outperform the current state-of-the-art screening prioritisation methods. However, BERT rankers and existing methods can actually be complementary and thus further improvements may be achieved if used in conjunction.

Recommended citation: Shuai Wang and Harry Scells and Bevan Koopman and Guido Zuccon. 2022. Neural Rankers for Effective Screening Prioritization in Medical Systematic Review Literature Search. In Australasian Document Computing Symposium (ADCS 2022). https://ielab.io/publications/pdfs/shuai2022neuralsr.pdf

Automated MeSH Term Suggestion for Effective Query Formulation in Systematic Reviews Literature Search Journal

Published in Intelligent Systems with Applications (ISWA) Technology-Assisted Review Systems Special Issue, 2022

Abstract

Medical systematic review query formulation is a highly complex task done by trained information specialists. Complexity comes from the reliance on lengthy Boolean queries, which express a detailed research question. To aid query formulation, information specialists use a set of exemplar documents, called ‘seed studies’, prior to query formulation. Seed studies help verify the effectiveness of a query prior to the full assessment of retrieved studies. Beyond this use of seeds, specific IR methods can exploit seed studies for guiding both automatic query formulation and new retrieval models. One major limitation of work to date is that these methods exploit `pseudo seed studies’ through retrospective use of included studies (i.e., relevance assessments). However, we show pseudo seed studies are not representative of real seed studies used by information specialists. Hence, we provide a test collection with real world seed studies used to assist with the formulation of queries. To support our collection, we provide an analysis, previously not possible, on how seed studies impact retrieval and perform several experiments using seed-study based methods to compare the effectiveness of using seed studies versus pseudo seed studies. We make our test collection and the results of all of our experiments and analysis available at http://github.com/ielab/sysrev-seed-collection.High-quality medical systematic reviews require comprehensive literature searches to ensure the recommendations and outcomes are sufficiently reliable. Indeed, searching for relevant medical literature is a key phase in constructing systematic reviews and often involves domain (medical researchers) and search (information specialists) experts in developing the search queries. Queries in this context are highly complex, based on Boolean logic, include free-text terms and index terms from standardised terminologies (e.g., the Medical Subject Headings (MeSH) thesaurus), and are difficult and time-consuming to build. The use of MeSH terms, in particular, has been shown to improve the quality of the search results. However, identifying the correct MeSH terms to include in a query is difficult: information experts are often unfamiliar with the MeSH database and unsure about the appropriateness of MeSH terms for a query. Naturally, the full value of the MeSH terminology is often not fully exploited. This article investigates methods to suggest MeSH terms based on an initial Boolean query that includes only free-text terms. In this context, we devise lexical and pre-trained language models based methods. These methods promise to automatically identify highly effective MeSH terms for inclusion in a systematic review query. Our study contributes an empirical evaluation of several MeSH term suggestion methods. We further contribute an extensive analysis of MeSH term suggestions for each method and how these suggestions impact the effectiveness of Boolean queries.

Recommended citation: Shuai Wang and Harry Scells and Bevan Koopman and Guido Zuccon. 2022. Automated MeSH Term Suggestion for Effective Query Formulation in Systematic Reviews Literature Search. In Intelligent Systems with Applications (ISWA) Technology-Assisted Review Systems Special Issue. https://ielab.io/publications/pdfs/shuai2022meshjournal.pdf

To Interpolate or not to Interpolate: PRF, Dense and Sparse Retrievers Short

Published in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022), 2022

Abstract

Current pre-trained language model approaches to information retrieval can be broadly divided into two categories: sparse retrievers (to which belong also non-neural approaches such as bag-of-words methods, e.g., BM25) and dense retrievers. Each of these categories appears to capture different characteristics of relevance. Previous work has investigated how relevance signals from sparse retrievers could be combined with those from dense retrievers via interpolation. Such interpolation would generally lead to higher retrieval effectiveness.

Recommended citation: Hang Li* and Shuai Wang* and Shengyao Zhuang and Ahmed Mourad and xueguang-ma and jimmy-lin and Guido Zuccon. 2022. To Interpolate or not to Interpolate: PRF, Dense and Sparse Retrievers. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022). https://ielab.io/files/li-sigir-2022-inter.pdf

From Little Things Big Things Grow: A Collection with Seed Studies for Medical Systematic Review Literature Search Resource

Published in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022), 2022

Abstract

Recommended citation: Shuai Wang and Harry Scells and Justin Clark and Guido Zuccon and Bevan Koopman. 2022. From Little Things Big Things Grow: A Collection with Seed Studies for Medical Systematic Review Literature Search. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022). https://ielab.io/publications/pdfs/shuai2022seedcollection.pdf

SDR for Systematic Reviews: A Reproducibility Study Reproduce

Published in Proceedings of the 44th European Conference on Information Retrieval (ECIR 2022), 2022

Abstract

Screening or assessing studies is critical to the quality and outcomes of a systematic review. Typically, a Boolean query retrieves the set of studies to screen. As the set of studies retrieved is unordered, screening all retrieved studies is usually required for high-quality systematic reviews. Screening prioritisation, or in other words, ranking the set of studies, enables downstream activities of a systematic review to begin in parallel. We investigate a method that exploits seed studies – potentially relevant studies used to seed the query formulation process – for screening prioritisation. Our investigation aims to reproduce this method to determine if it is generalisable on recently published datasets and determine the impact of using multiple seed studies on effectiveness. We show that while we could reproduce the original methods, we could not replicate their results exactly. However, we believe this is due to minor differences in document pre-processing, not deficiencies with the original methodology. Our results also indicate that our reproduced screening prioritisation method, (1) is generalisable across datasets of similar and different topicality compared to the original implementation, (2) that when using multiple seed studies, the effectiveness of the method increases using our techniques to enable this, (3) and that the use of multiple seed studies produces more stable rankings compared to single seed studies. Finally, we make our implementation and results publicly available at the following URL: https://github.com/ielab/sdr

Recommended citation: Shuai Wang and Harry Scells and Ahmed Mourad and Guido Zuccon. 2022. SDR for Systematic Reviews: A Reproducibility Study. In Proceedings of the 44th European Conference on Information Retrieval (ECIR 2022). https://link.springer.com/chapter/10.1007/978-3-030-99736-6_46

MeSH Term Suggestion for Systematic Review Literature Search Long

Published in Australasian Document Computing Symposium (ADCS 2021), 2021

Abstract

High-quality medical systematic reviews require comprehensive literature searches to ensure the recommendations and outcomes are sufficiently reliable. Indeed, searching for relevant medical literature is a key phase in constructing systematic reviews and often involves domain (medical researchers) and search (information specialists) experts in developing the search queries. Queries in this context are highly complex, based on Boolean logic, include free-text terms and index terms from standardised terminologies (e.g., MeSH), and are difficult and time-consuming to build. The use of MeSH terms, in particular, has been shown to improve the quality of the search results. However, identifying the correct MeSH terms to include in a query is difficult: information experts are often unfamiliar with the MeSH database and unsure about the appropriateness of MeSH terms for a query. Naturally, the full value of the MeSH terminology is often not fully exploited.

Recommended citation: Shuai Wang and Hang Li and Harry Scells and Daniel Locke and Guido Zuccon. 2021. MeSH Term Suggestion for Systematic Review Literature Search. In Australasian Document Computing Symposium (ADCS 2021). https://dl.acm.org/doi/abs/10.1145/3503516.3503530

IELAB at TREC Deep Learning Track 2021 Notebook

Published in TREC 2021 Deep Learning Track, 2021

Abstract

Recommended citation: Shengyao Zhuang and Hang Li and Shuai Wang and Guido Zuccon. 2021. IELAB at TREC Deep Learning Track 2021. In TREC 2021 Deep Learning Track. https://ielab.io/publications/pdfs/arvin2021trecdl.pdf

BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval Short

Published in The Proceedings of the 2021 ACM SIGIR on International Conference on Theory of Information Retrieval (ICTIR 2021), 2021

Abstract

The integration of deep, pre-trained language models, such as BERT, into retrieval and ranking pipelines has shown to provide large effectiveness gains over traditional bag-of-words models in the passage retrieval task. However, the best setup for integrating such deep language models is still unclear.

Recommended citation: Shuai Wang and Shengyao Zhuang and Guido Zuccon. 2021. BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval. In The Proceedings of the 2021 ACM SIGIR on International Conference on Theory of Information Retrieval (ICTIR 2021). https://dl.acm.org/doi/abs/10.1145/3471158.3472233

Shuai Wang (Dylan)

Publications

Pre-training vs. Fine-tuning: A Reproducibility Study on Dense Retrieval Knowledge Acquisition Reproduce

Abstract

Reassessing Large Language Model Boolean Query Generation for Systematic Reviews Reproduce

Abstract

2D Matryoshka Training for Information Retrieval Reproduce

Abstract

Corpus Subsampling: Estimating the Effectiveness of Neural Retrieval Models on Large Corpora Long

Abstract

Starbucks: Improved Training for 2D Matryoshka Embeddings Long

Abstract

Context Embeddings for Efficient Answer Generation in RAG Long

Abstract

BERGEN: A Benchmarking Library for Retrieval-Augmented Generation Resource

Abstract

Zero-shot Generative Large Language Models for Systematic Review Screening Automation Long

Abstract

Large Language Models for Stemming: Promises, Pitfalls and Failures Short

Abstract

Evaluating Generative Ad Hoc Information Retrieval Long

Abstract

FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation Resource

Abstract

ReSLLM: Large Language Models are Strong Resource Selectors for Federated Search Short

Abstract

Zero-shot Generative Large Language Models for Systematic Review Screening Automation Long

Abstract

Generating Natural Language Queries for More Effective Systematic Review Screening Prioritisation Long

Abstract

Can ChatGPT Write a Good Boolean Query for Systematic Review Literature Search? Long

Abstract

Balanced Topic Aware Sampling for Effective Dense Retriever: A Reproducibility Study Reproduce

Abstract

MeSH Suggester: A Library and System for MeSH Term Suggestion for Systematic Review Boolean Query Construction Short

Abstract

Neural Rankers for Effective Screening Prioritization in Medical Systematic Review Literature Search Long

Abstract

Automated MeSH Term Suggestion for Effective Query Formulation in Systematic Reviews Literature Search Journal

Abstract

To Interpolate or not to Interpolate: PRF, Dense and Sparse Retrievers Short

Abstract

From Little Things Big Things Grow: A Collection with Seed Studies for Medical Systematic Review Literature Search Resource

Abstract

SDR for Systematic Reviews: A Reproducibility Study Reproduce

Abstract

MeSH Term Suggestion for Systematic Review Literature Search Long

Abstract

IELAB at TREC Deep Learning Track 2021 Notebook

Abstract

BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval Short

Abstract