Colourful, circles hanging from above. Abstract.

Text and data mining (TDM) guidance

Learn about what you can do within copyright working with large amounts of data.

Text and data mining (TDM) and copyright

Text and Data Mining (TDM) is the practice of using technology to scan large amounts of data or text in order to identify themes, issues and correlations across different publications. This is permissible under UK copyright law for research that is conducted for non-commercial reasons (s.29A, CDPA). So as long as the works are accessed lawfully (for example, via a subscriptionor an open access route) it is permitted for researchers at LSE to perform TDM.

Whilst the law enabling TDM cannot be overridden by contract, it is important to be aware that technical restrictions may prevent text and data mining; researchers should not try to circumvent these measures. We recommend that researchers who wish to perform text and data mining on databases provided by LSE Library contact the library in advance. We will liaise with the publisher and arrange for the text and data mining to take place. If you would like to make a request for text and data mining then please contact us.

Where copies have been downloaded for TDM purposes these cannot be shared or transferred to anyone else or be used for any other purpose than that covered by the exception.

Copyright and outputs

Outputs which consist of facts or data can be freely shared, as these are not covered by copyright. However, copyright will apply when data have been interpreted, selected and arranged and the resulting work demonstrates intellectual effort. Database rights, another type of intellectual property protection, apply to the contents of a database.

For TDM outputs that include 3rd party copyright, the quotation exception (s.30, CDPA) allows sharing with individuals not involved in the original research under certain conditions:

  • The original work must have previously been made available to the public.
  • The amount of the work that is shared must be no more than is necessary for the intended purpose. It must also satisfy fair dealing requirements.
  • Wherever possible, the quotation should be accompanied by sufficient acknowledgement.

For further information on TDM and Copyright see:

Publishing and archiving TDM based research

The copyright exception which allows TDM only covers the analysis period, after this point researchers will need to rely on other copyright exceptions or an appropriate licence from the data owner. If the materials being mined are being used under a licence or subscription agreement then it is important to follow any data retention and re-use clauses in the terms and conditions. If the data cannot legally be stored beyond the analysis stage, it is important to document both the data sources and methodology used to ensure that other researchers can reproduce and validate your results.

TDM, Artificial Intelligence and Library resources

Inputting licensed content from library subscriptions into GenAI tools can potentially be interpreted as permissible computational analysis. However, if the tool retains copies of the inputted material this is likely to be interpreted as infringing copyright because it would be accessible to others not covered by the TDM exception. When using GenAI with library licensed content, ensure the tool does not store inputs or use them to train data. If carrying out TDM on material licensed under Creative Commons check the terms of the licence to ensure compliance.

This advice is based on JISC’s Guidance on resisting restrictive AI clauses in licences.