Prompting

If accuracy is important to the task, incoprorate verification helper steps in the prompt or subsequent prompts, so that the LLM is instructed to 'show its working' with direct quotes and perhaps a reflective statement explaining its response. This makes it easier for you to verify.

Along with context length constraints, attending to prompt quality is the other major consideration for LLM output quality. Dr Mark Carrigan, in a 2023 LSE Blog article, made the following observation which applies to many who are new to LLMs:

“I have noticed a tendency for critical scholars to share examples of how their prompts elicited an underwhelming or superficial reaction from ChatGPT. The uniform feature of these examples was that little thought had gone into the prompt: they were extremely brief, failed to define the context of the request, provided no sense of the result they were expecting and certainly did not provide examples.”

Superficial engagement with generative AI masks its potential contribution as an academic interlocuter

There’s an interesting contradiction here in that those who are disappointed with the quality of outputs also by implication assume the model is vastly more powerful than it is, since they expect outstanding answers from a brief prompt. Every single word in a prompt will impact the response. The length, style, tone and diction you choose all have a significant effect. All the versatility and nuance of the English language is at the user's disposal to help improve output quality, and advantage should be taken. As writing quality is an essential skill for academic work, researchers theoretically have an advantage to reap the benefits of generative AI.

Detailed, clear and explicit prompts along with experimentation and refinement bring substantial dividends. A common error is not giving enough contextual detail (even more important since the introduction of 'reasoner' models), so a useful starting approach is to imagine a brand new colleague on their first day. You would need to give them a lot more background info, explain local jargon, pre-empt typical areas for confusion a new starter might have and so on.

Here are the more well-established tips on effective prompting for non-reasoner models:

GAIR_Prompting_Guide

Note on ‘Prompt Engineering Tricks’: Many of the above approaches involve elaborate prompt ‘hacks’ such as pretending to be an expert or adding lines like ‘Think step by step.’ With the emergence of modern 'reasoning' architectures (beginning with OpenAI’s o series), the model already generates simulated chain-of-thought reasoning internally. In most cases, simply providing thorough context, explicit instructions, and clarity of desired output format with examples yields the best results. Overly ornate or ‘hacky’ prompts can confuse them so concise clarity is preferred.

Zero-Shot, One-Shot, and Few-Shot Prompting

Advanced LLMs learned from examples and as a result they are typically more adept at producing quality outputs based on examples of desired outputs alongside instructions, than from instructions alone. A prompt with instructions only, the most common way most use LLMs, is known as 'zero shot', and relies overwhelmingly on its training data correlations. The more it relies on training data alone, the more likely it is to 'hallucinate' (more on that below).

One-shot prompting involves the user providing a single example of the desired output format or content alongside the instructions, e.g.:

"You are a concise summariser of sociological theories for a textbook glossary. I will be giving you examples of theories and you should summarise them faithfully in no more than a single sentence. For example, if I were to ask for a summary of functionalism, you would response with "A theory that views society as a complex system of interconnected parts working together to maintain stability". Now please provide summaries for the following: 1. Conflict Theory; 2. etc."

Few-shot prompting extends this by providing multiple examples, which may be helpful for comparative or nuanced language tasks, for example to help with thematic coding of qualitative data:

"Quote 1: 'I feel anxious when I have to speak in public.' Code: Social anxiety;

Quote 2: 'I prefer working alone because I can focus better.' Code: Introversion

Quote 3: 'I often find myself worrying about things that might go wrong in the future'. Code: Generalised anxiety."

The best advice however is to experiment and note patterns where outputs fail to meet expectations, and take note for future prompts. Saving your personal prompt templates for common tasks is strongly recommended, as is taking advantage of 'custom instructions' (aka the 'system prompt') where available. For example below is a template that incorporates reflective thinking through multiple personas, designed to get a higher quality result each time because the model is given the opportunity to reflect, critique and adapt and thereby 'think before it speaks'.

Example of a personal template prompt designed for higher quality outputs

"Agent 1 Prompt: You are an expert in academic research. Based on the requested output in [prompt_detail], provide an accurate, grounded response based firmly in the text provided in [provided_text], to identify practical research project ideas emerging from the paper. Important: You must provide direct quotes from the paper to support your suggestions, so it's easy to verify.

After Agent 1's response comes in:

Agent 2 Prompt: You are an expert in academic research who is obsessed with accuracy and evidence for any claims made. Your task is to evaluate the accuracy of the output provided by another LLM agent from the provided academic text. Assess how well the output aligns with the paper's content and whether the direct quotes provided accurately support the claims.

Initial Prompt: [prompt1_text]

Paper Text: [provided_text]

First LLM Response: [agent1_text]

Evaluate the accuracy of the research project ideas and the relevance of the direct quotes provided. Highlight any discrepancies or areas where the response could be improved.

After Agent 2's response comes in:

Agent 3 Prompt: You are an expert in academic research. Your task is to provide detailed suggestions on how to improve the accuracy and overall quality of the output from another LLM to a user's prompt. Review the initial prompt, the paper text, the first LLM response, and the accuracy evaluation from the second LLM response. Provide constructive feedback on how to enhance the accuracy, clarity, and value of the response.

Initial Prompt: [prompt1_text]

Paper Text: [provided_text]

First LLM Response: [agent1_text]

Accuracy Evaluation: [agent2_text]

Provide detailed suggestions on improving both the accuracy and overall quality of the output.

After Agent 3's response comes in:

Agent 4 Prompt: You are an expert in academic research. Your task is to re-answer the initial prompt, taking into account all previous responses and evaluations. Use the initial prompt, the provided paper text, the first LLM response, the accuracy evaluation, and the quality improvement suggestions from the other LLM responses, in order to create the best quality and most well-informed answer possible.

Initial Prompt: [prompt1_text]

Paper Text: [provided_text]

First LLM Response: [agent1_text]

Accuracy Evaluation: [agent2_text]

Quality Improvement Suggestions: [agent3_text]

Based on the provided information, re-answer the initial prompt to extract practical research project ideas from the academic paper, ensuring the highest level of accuracy, clarity, and value."

Hallucinations

As of 2025, generative AI tools even when connected to browsers, search engines and databases, are not good enough to rely on as information sources. Everything where accuracy is important needs to be verified independently. In many cases this is fine if you just need a starting point for your own investigation. In others the amount of time needed is such that you’re better off just doing it directly yourself. Searching, identifying and evaluating information is a vast, complex, iterative and rigorous process with which most humans struggle. Current LLMs are not capable of doing so, certainly in any reasonable timeframe or without enormous initial effort for a controlled pre-designed workflow.

Why do LLMS ‘hallucinate’? The fact that people are surprised that LLMs output errors is a combination of not understanding what an LLM actually does, combined with the fact that it is so often correct, which understandably makes novices assume it is somehow ‘looking up information’. Pure LLMs are not machines that produce deterministic outputs based on data. In the simplest terms, they predict words (tokens really) based on probabilities of word associations in the trillions of words in the training data along with their (increasingly well-) developed world model conceptual understanding. By default they mix things up so that it’s not always the most high probability word next. This is the ‘temperature’ setting available in the API which is sometimes referred to as how ‘creative’ outputs are, although predictable might be a better word. Novel artificially generated content cannot exist without hallucinations, indeed there would be no Chat GPT if it wasn't capable of hallucinating. It's a feature, not a bug.

The key thing to remember is that LLMs should not be thought of as fact-finding oracles. They produce plausible looking text because they are excellent simulators of human text. The fact that (mostly) word association probabilities result in correct information so often is astounding in itself, but also dangerous if people assume it’s actually 'finding' information in some sense. This assumption is common among students and novices to LLM generally who initially perceive a tool like Chat GPT as a search engine on steroids.

Another important factor is that the top LLMs have been trained on public internet texts. Consider the landscape of Q&A style websites like Quora, Twitter, Reddit, Facebook, any specific discussion forum etc. How often do you see people putting in the effort to post “I don’t know”? Almost never. So the training data vastly overrepresents text from humans who are confident enough to answer publicly, but neither their confidence nor the content of their answer is a reliable predictor of accuracy. LLMs do extremely well in simulating that kind of confident answer, but you can never assume it’s correct without additional independent verification. A further concern is that, as expected for anything trained on human-produced data, underlying biases may come through any AI-generated ‘information’, so critical engagement is always needed.

Give it the knowledge it needs in the prompt

Pasting the knowledge into the prompt context (known as in-context learning, differentiated from lower quality RAG systems) gets by far the best quality results, with the only downside being context length constraints. Chunking the knowledge and tasks up beforehand to stay comfortably within the context limit is strongly recommended for higher quality outputs.

Making verification simpler

Whenever it’s important to have the right information, it needs to be verified because there's always the possibility of inaccuracy. But you can adapt the prompt such that it makes the verification process simpler, or at least far less painful. Ultimately this can be thought of as instructing the LLM to ‘show its working’ - this does NOT mean it's a legitimate causal representation of its 'working', since it's all ultimately a simulation, the goal is to make your life easier when double checking.

Here's an example where an LLM had to find matches or near matches to set keywords based on any given text. Given the propensity to hallucinate, the instructions require adding 2 additional columns to the output in addition to the matches it identified. Firstly, a direct quote is mandatory, which means the user can more easily do a search using Ctrl-F on the original text to check if the classification was correct. It would be much harder to verify without a direct quote. Secondly, a reflection column is required where the LLM has to rationalise its decision on identifying the match. This isn't to be thought of as a true reflection of its 'thoughts'; it's simply designed to maie it easier to see at a glance and help check the original source and either verify, amend or remove entirely.

“Please read the following text and identify any words, phrases or sentences that match or relate fairly closely to the above core keywords. Please use your intelligent judgement to identify matches - close enough matches are totally fine if you find any examples of text that are more or less similar in meaning to any of the original keywords.

The outputs needed are:

1) a direct quote (sentence or phrase) from the text that surrounds the match or near match,

2) the matching keyword from the original key word list, and

3) Reflective notes – these should always be included, so if you find a perfect match, add a brief note indicating as such.

If you find a borderline or ambiguous ‘close enough’ match, please add a suitable note indicating the uncertainty. If you find nothing at all relating to any of the keywords, please also clarify with the reflective note. It is preferable to identify false positives because I can delete irrelevant ones easily later after seeing your clear direct quote and notes, whereas I don’t want to go through every text to double check in case you missed a borderline match. The goal is identify every possible keyword match even if vaguely related, and show your working with quotes and brief reflective notes, in order to make it as easy as possible for me to look up and verify your outputs."

While these steps mean more effort and using up more context length, if the task is one where accuracy is important and therefore verification is inevitable, it’s worth doing because the effort of verifying without the additional help can often take more time and lead to frustration.

Additional guidance on prompting (includes code/API examples)

Microsoft Copilot Intro to creating effective prompts

Anthropic guide to prompting

Google whitepaper on prompt engineering

Open AI Prompt Engineering guide