GAIR_Intro_Banner

Introduction to Generative AI for researchers

A useful way to think about LLMs is as a well-read, tireless and eager-to-please ‘intern’ who often misunderstands and gets things wrong and so needs explicit guidance, steering and examples.

Generative AI (GAI) in its broadest sense refers to AI systems that create new content, predominantly text but also images, audio and video, based on users’ natural language prompts. Before 2022, the most impressive AIs at least as far as widespread public awareness is concerned were narrow in both scope and value, e.g. Alpha Go or IBM Watson winning Jeopardy. Most research or work-related examples were in the domains of prediction, classification and basic natural language processing like sentiment analysis or chat bots informed by a specific, curated knowledge base. Narrow but extremely effective AI ‘recommendation engines’ proliferated in the corporate and social media spheres. The idea of a more generalised artificial intelligence was still mostly theoretical until a major advancement that combined neural networks (in particular the transformer model) and colossal investment in computational hardware and data. This led to the development of GPT (Generative Pre-trained Transformer), with Open AI’s GPT-3 being the first instance of a chat bot, powered by a Large Language Model that was capable of convincingly human-like textual interactions, including entirely novel language outputs.

Large Language Models

By far the most prominent form of GAI now, and likely in future given the dominance of language in most human interactions that we would associate with any kind of intelligence (and certainly for the majority of scholarly research), is the Large Language Model (LLM). LLMs are fairly simple to grasp: for any given text, what is the likely text that would follow it, given the word associations in the training dataset?  This Medium article  illustrates the basic concept (though in an extremely simplified way): GAIR_Token_PredictionGPT4 has been trained on almost all the public text on the internet (speculated to be in the trillions of words), and its objective has been to predict the next sequence of words as well as it can. In order to be able to achieve effective prediction, it has to somehow internalise a model of the human world, which confers an advanced – though still limited and clearly not human – understanding ability, simply from the correlations in text it’s been trained on. In a sense, the text itself can be considered a limited projection of the world and by implication of humanity. A fascinating report published by Anthropic in 2024 entitled "Mapping the mind of a large language model”, offers promising early insights into how LLMs create their own internal abstractions which ‘fire’ in relevant sections of text, e.g.:

GAIR_Anthropic_LLM_Abstractions

While these insights representing exciting - and critically important for long term human-AI value alignment - early steps, advanced LLMs are still much too complex and advanced to provide the kind of explainability or audit trail most early AI regulations require. The fundamentally stochastic and unpredictable nature of LLM outputs also means strict research reproducibility cannot be achieved outside of extremely simple classification tasks.     

A useful way to think about LLMs is to treat them as a well-read, confident and eager-to-please ‘intern’ who often misunderstands and gets things wrong. They are not deterministic machines. What they excel at is in producing plausible-looking text, which is a remarkable technological breakthrough in itself (and which has necessitated moving the goalposts on the Turing Test), but it doesn’t even begin to constitute a valid or reliable source of knowledge.  Plausible looking text outputs  are often correct, so it’s forgivable if people accept its outputs at face value. But  LLMs can and do produce bizarrely incongruous ‘hallucinations’ where they confidently output entirely incorrect assertions. There’s nothing surprising about this at all since it’s just simulating human text, but because its statistical models produce outputs that are correct most of the time, it can be jarring when it makes such trivial factual errors.

Potential future directions

Current conversations about the future trajectory of LLMs recognise the inherent deficits of training on vast, messy human-produced text and simulating ‘answers’ based on the messy source and the hugely constraining impact of maximum context length windows. While scaling up to the tune of training on trillions of words is the number one reason LLMs are as good as they are now, efforts are being made to use GAI to generate synthetic training data, iteratively evaluated, to maximise the quality of underlying data by reducing contamination of organically generated human errors, the hope from which is increased reliability, accuracy and overall quality of outputs. A paper entitled, “Textbooks are all you need” (Gunasekar et al. 2023) showcased impressive results given the tiny corpus of training data and complexity of the model (1.3bn parameters, compared to GPT4 which, while not official, is said to have over a trillion). ‘Garbage in, garbage out’ has always been true and it’s certainly the case for LLMs trained on almost the entirety of public human-produced text.

While LLMs are unlikely to be the entire basis for a future superintelligent AGI (artificial general intelligence), there's still a lot more potential that can be gleaned from them. The guidance on prompting mentions the improved performance when incorporating reflection (‘think step by step’ or ‘critique the previous response from persona X and suggest improvements’) or by grounding interactions in specific knowledge (retrieval-augmented generation or RAG, using search to identify relevant information to answer a query). Currently this requires significant effort and is very much slow motion; work is being undertaken to create models that effectively ‘think before they speak’ in a way that would inevitably be slower than current pure prediction chatbots, but which will hopefully be quicker than manually forcing the reflection and reasoning through repeated prompts. Recent hardware innovations like the Groq chip which is specifically designed for LLMs display extraordinary speed improvements – this X/Twitter video shows just how fast it already is, so combining that with reflection stages with multiple personas has a lot of potential for improved quality.

There are also early - though very limited and unreliable - manifestations of what some refer to as ‘agent models’ that are integrated with software systems and data, such you can have a ‘lead’ or ‘orchestrator’ LLM directing other LLMs to look something up from a document, the web or a database, another to interpret and report back, have another validator LLM to review, one to store the new information somewhere, then the lead LLM (or the ‘human in the loop’) can review and decide on further actions. It's viable in principle to have a collection of agents with an orchestrator simply to run continually and monitor stock prices and company news, review charts and initiate buy/sell orders based on pre-defined rules. But given the inherent problems with LLMs (and the fact that unexpected software and connection failures happen frequently and an LLM won't know how to handle it), this idea should not be attempted by anybody beyond a fun small-scale experiment. As of February 2024, one of the more capable multi-agent systems is Microsoft’s Autogen, though it requires significant coding expertise and guardrails given the potential for LLMs to go on tangents or get stuck in infinite loops, which means continual human in the loop systems are likely to be the default for some time.

GAIR_LLM_Agent_Architecture

The other areas being actively developed relate to increased input context lengths (the latest Claude 3 models (Opus 3 and Sonnet 3.5) show impressive quality with context lengths of 200,000 tokens, and  Google’s Gemini 1.5 test results release on 15th Feb 2024 suggests game-changing ‘needle in a haystack’ retrieval capabilities up to a staggering 1 million tokens), improved reasoning using the ‘mixture of experts’ model so that multiple narrower, more specialised AI models can work on specific sub-tasks.

Other examples of Generative AI

The most capable non-text based GAI technology available to the general public is for image generation and image interpretation / analysis, with applications like Dall-E 3 (available in Microsoft Copilot – formerly known as Bing Chat, as well as Chat GPT Plus / Team) and Midjourney currently being the best quality models that are widely available and easy to use.

Dall-E is generally the best at following explicit instructions (though still far from perfect) but without high image quality, Midjourney v6 which is the best at generating extremely high quality images but with a very tenuous link to instructions, and Stable Diffusion, which is an open source and more complex application that is capable of producing remarkably high quality with the right fine-tuning. As far as social science research value is concerned, image generation AI is mostly limited to enhancing blogs, presentations or other knowledge exchange communications. It’s worth reminding researchers of the earlier caution around ongoing legal cases regarding intellectual property and how AI models including image generation were trained on public human-created content. Currently, the vast majority of AI image generation is in the realm of art, rather than, say, technical diagrams. That said, for diagrams which are programmatically generated, the more advanced LLMs are capable of producing accurate flowcharts and infographics – here are some example using the Claude 3.5 Sonnet ‘artifacts’ (interactive web output display) feature:

Example flow diagram generated by Claude Sonnet 3.5 based on the above section on history and current state of GAI:

(right click and open in new tab for higher resolution)GAIR_Mermaid_Diagram

Example infographic generated by Claude Sonnet 3.5 based on the above section on potential future developments: GAIR_Future_GAI

Video generation accessible to the public is still very limited, though in Feb 2024 Open AI shared results from their new AI video generation model Sora which has far exceeded expectations. One touted but unverified potential breakthrough with Sora is less about quality video generation and more its ability to model realistic (though not perfect) physics based on general understanding of the physical world through its video training data. This has the potential to be revolutionary for conducting scientific experiments including AI-powered robotics longer term via simulations. There could also be scope in future for academic papers to serve as natural language prompts that inform AI-generated video ‘explainers’ for knowledge exchange, but until technical visuals (rather than artistic ones) are possible it would just be a starting point for productivity.

While synthesised speech from text content has been around for a while with limited quality, the ability to combine generative AI to create original speech with realistic and emotive voices (currently ElevenLabs are the leader in this field, though in March 2024 Open AI released early results of a highly advanced model including troublingly realistic voice cloning) could be very valuable, for instance summarising an academic article with a realistic human voice or even having a real time verbal conversation about an academic article, or creating experimental simulations of focus groups to better inform question design. 

While there have been substantial breakthroughs in AI music generation in April 2024, with Suno AI and Udio being the current best in class platforms, other than helping with KEI for video explainers it’s not obvious if music generation would be particularly useful for social science research.

Demos from May 2024 from Open AI (GPT-4o – for ‘omni’) and Google’s Project Astra showcase impressive real time, realistic multimodality including native voice input and output, as well as live video input (based on repeated timed screenshots) and nuanced interpretation of affect in voice and even facial expressions, that can dramatically change how AI can be used for everyday tasks as well as education. In the social sciences, the potential for having an AI ‘see what you can see’ and provide commentary and input at the same time may provide research value but gaining consent for human subjects may prove difficult. Many people will find the idea of AI analysing their emotions from their face and voice in real time uncomfortable. Resistance could even extend to more benign visual analysis such as human movement in urban settings to inform space planning. There is significant potential for researching revealed preferences through visually analysing actual behaviour in humans, but it’s difficult to imagine examples beyond highly controlled settings where participants are fully aware and consent, which of course may ‘contaminate’ the validity of data given humans behave differently when they know they’re being observed. 

High level overview of current vs future value of generative AI for academic research

The table below lists broad categories where LLMs specifically can be useful to support research, along with ratings out of 5 for the value GAI can provide for that category. It distinguishes between current (June 2024) value and potential future value, which may include better incorporation with data and software.

Important notes:

-            Scores assume the ‘best result’ from high quality prompting including chain of thought reasoning with multiple personas as well as ‘few shot’ examples and, where relevant grounded curated data as ‘knowledge’ for the LLM.

-            Value scores are solely conceiving the LLM as a ‘copilot’ to support research processes and enhance productivity. We’re still a way from AI being able to design and conduct substantive academic research independently (though some fascinating limited pilots have already been conducted: AI Coscientist automates scientific discovery).

-            Value scores relate solely to the domain of academic research; for mass appeal blog articles, the ‘drafting from scratch’ current value score would be much higher for instance.

-            This table makes no assumptions about risk or ethics, nor does it represent formal policy recommendations, this is purely about capabilities.

 GIAR_Current_Future_Value

UNESCO decision tree on when it’s viable to use Chat GPT

UNESCO’s Quick Start Guide on ChatGPT and Artificial Intelligence in Higher Education has a useful diagram explaining at the most abstract level when it’s ‘safe’ to use ChatGPT, which as of May 2024 is still applicable given the inherent constraints regarding low reliability of seemingly ‘factual’ outputs:

 GAIR_UNESCO_Chart

As integration with dedicated tools (e.g. Python and other programming languages, Wolfram Alpha, Zapier, web browsers etc.) and data sources improves, the value of LLMs can be enhanced in ways that mitigate its deficits in accuracy and reliability, such that it’s forced to cite specific, verifiable information in its outputs. Until this integration improves to a sufficiently advanced and efficient level and is fully accessible, LLMs are generally best avoided as a standalone tool if accurate information is required. Every piece of ‘information’ in LLM outputs cannot be trusted without you verifying. In many cases it won’t be worth the effort of using GAI for ‘information’ at all, since you’ll be spending so much time verifying its accuracy as well as identifying information it hasn’t included. Far more value can be achieved via its use as an assistant for limited language tasks.

Recommended Reading

Bail, C. A. (2023). Can generative AI improve social science? (Pre-print).

Burger, B., Kanbach, D. K., Kraus, S., Breier, M., & Corvello, V. (2023). On the use of AI-based tools like ChatGPT to support management research. European Journal of Innovation Management, 26(7), 233-241.

Dwivedi, Y. K., Kshetri, N., Hughes, L., Slade, E. L., Jeyaraj, A., Kar, A. K., Baabdullah, A. M., Koohang, A., Raghavan, V., Ahuja, M., & Albanna, H. (2023). “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. International Journal of Information Management, 71.

Korinek, A. (2023). Generative AI for economic research: Use cases and implications for economists. Journal of Economic Literature, 61(4), 1281-1317.

Lenhard, W., & Lenhard, A. (2023). Beyond human boundaries: Exploring the proficiency of AI technology and its potential in psychometric test construction. (Pre-print).

Manning, B. S., Zhu, K., & Horton, J. J. (2024). Automated social science: Language models as scientist and subjects. (working paper). Massachusetts Institute of Technology and Harvard University.

Pack, A., & Maloney, J. (2023). Using generative artificial intelligence for language education research: Insights from using OpenAI's ChatGPT. TESOL Journal, 57, 1571-1582.

Rahman, M., Terano, H. J. R., Rahman, N., Salamzadeh, A., & Rahaman, S. (2023). ChatGPT and academic research: A review and recommendations based on practical examples. Journal of Education, Management and Development Studies, 3(1), 1-12. https://doi.org/10.52631/jemds.v3i1.175.

Watkins, R. (2023). Guidance for researchers and peer-reviewers on the ethical use of large language models (LLMs) in scientific research workflows. AI and Ethics, 1-6.