Our tests of summarisation tools show some encouraging results but point to significant challenges if the end-goal is a totally automated chain from evergreen text to an audience-facing summary. The extractive approach of AI-driven summarisation tools gives a high relevance to the initial parts of the underlying articles, like titles, leads and first sentences, depending on the content on which the tools were trained.
It is evident that the tested models are optimised and trained on standard news articles which causes a lower tolerance for unusual, creative formats and other genres like features and listicles. The inverted pyramid news style is considered as the norm. The ideal text is factual, concise and with a start that is clear and logically related to the headline, at least to produce a good result with the Agolo’s model.
Moreover, the tools generally had significant problems summarising extensive articles, no matter the genre. The tested tools are very impressive at summarising the short and straightforward but struggle with the long and creative. It is important to note, however, that the tool-makers do not claim that their products will work on the longer and more innovative formats that often characterise the evergreens used in this study. We can well envisage tools becoming much better at summarising evergreens if they are provided with a relevant training set. But as the tools work best on clear and homogenous text structures, it could well be that existing formats are cemented by integrating a summarisation tool in the newsroom workflow, thus impeding creative experimentation.
Generally, the speakable summaries (of up to 50 words) rated significantly higher in our tests than the bullet points. Agolo serves a convincing case in summarising English language news for voice platforms. The reason is that the extractive model picks out and includes sentences from the lead of the articles. The bullets on the other hand seem to be picked from the whole length of the text. This sometimes muddles the logic between the extracted bullets.
These findings are also relevant when assessing the journalistic text quality of the summaries. Cohesion between text elements inside sentences is not a problem as the extractive models keep the original, single sentences intact. Coherence, on the other hand, proves to be more challenging as the tools glue single sentences together in a way that frequently obscures logical connections and sometimes creates grammatical mistakes.
Having said that, the tools generally performed well on grammar. Again, this is because of the extractive nature of the tools. Journalistic texts are supposed to be flawless or at least strong grammatically. That strength is carried through to the summaries. Transcription and quotes, though, were generally detrimental to the grammatical quality. Furthermore, it is only in very rare cases that automated summaries can serve as perfect teasers dropping out of the AI-machine. A well-written teaser stimulates the reader’s curiosity without being too explicit. This very human, editorial quality gets lost in the AI-summaries.
One of the SR audio tests produced some very interesting results. A complicated chain of steps: 1) transcription, 2) translation, 3) summarisation, can actually work and provide an almost perfect summary. But the usual preconditions apply: the start of the text has to be so factual that it transcribes well, and the summary format should not be too long.
Finer details of summarisation, like the coreference resolution feature, depend on the availability of annotated training data. In our case, these were not available in German because the corpus was reserved for academic use. In English, the availability of training data is generally better. This makes development and training in non-English languages significantly harder. Hence, it is not surprising that most practical integrations of AI-summaries to date have been done in American newsrooms, where the linguistic, technical, and financial environments are most favourable.
5.1 Capturing of facts and logic
Most reporters are taught on their first day of journalism school that the start of a text is crucial for reader engagement. The start of the article is also decisive for how well facts are captured by AI-summaries. This point is well illustrated by the tests Der Spiegel did with summarising background articles on climate change. These long, well-written articles often have scenic introductions which cause a problem for the summarisation tools. On several occasions, the tools did not get the main point as they were led astray by the creative introductions and they produced logic mistakes when combining two sentences that did not fit together (wrong reference).
Gauging the results from the Agolo tool underscores these conclusions. The speakable summaries by Agolo of the same articles were generally good and got high scores because in this case the tool exclusively chose sentences from the lead or the introduction of the articles. It is almost as if the tool “capitulates” to the extensive articles and extracts something from the top as the easy way out. For example, this article about different technological solutions tackling climate change causes was summarised by extracting the first sentence of the lead:
“The climate goals can only be achieved if we actively remove CO2 from the atmosphere.”
Through our contacts with Agolo, we know that their model ranks the sentences in the body of the text in relation to the headline, and there is more likely that you find sentences that logically correspond to the headline early on. The bullet points, however, are in many cases extracted from the whole text and got very low scores for capturing facts and logic in Der Spiegel’s tests, which is presumably also due to the significant length of the articles.
The Swedish Radio results point in the same direction. This audio piece, 2:18 minutes long and in Swedish, entitled “Kina kritiserar Sveriges coronahantering” (in English: China criticizes Sweden's corona handling”) was automatically transcribed using the SR transcription tool, then translated through Google Translate into English. It was then run through the Agolo summariser, which produced this summary in 38 words:
“China criticizes Sweden's handling of the Coronavirus and in a newspaper affiliated with the ruling Communist Party calls on the international community and the EU to condemn Sweden, which is believed to have capitulated to the virus.”
This summary is remarkably close to the manual benchmark summary we did beforehand:
“China criticizes Sweden's handling of the coronavirus. A newspaper close to the ruling Communist Party calls on the international community and the EU to condemn Sweden, which is believed to have "capitulated" before the virus outbreak. Experts see the criticism as serious and a sign of the bad relations between the countries.”
It is important to bear in mind that the text was both automatically transcribed and translated from Swedish, but the tool still produced an accurate result. The reason is simple, and reminds us of Der Spiegel’s findings: the first sentence of the story was in itself a very good summary of the whole piece as well as being straightforward and factual. Thus, the transcription and translation both carried the original meaning in a good way. But already in the second sentence, the transcription tool committed a fatal error. Faulty punctuation and a difficult name of the interviewee made the transcription/translation incorrect, which was then reproduced in the second bullet point provided by Agolo:
"The crime Björn gives it a head of the Olympic program at the Foreign Policy Institute sees the criticism as serious."
As you see, this bullet point is both incomprehensible and flawed in a grammatical sense. The name “Björn Jerdén” is understood as “Björn ger den” (in translation: “Björn gives it”). And an even more curious detail is how this “Olympic program” appeared here as it has nothing to do with the story. The reason is again a misrepresentation by the Swedish transcription tool: “Asienprogrammet” (which means “Asia program”) is transcribed as “OS programmet” (which translates to “Olympic program”).
The Swiss tests by TX Group also point to significant problems with capturing fact and logic when automatic translation is introduced as one of the steps. In the summaries that came out from the articles that were first translated from German to English, then summarised, then translated back to German, essential facts were often missing. In some cases, there were substantial factual errors included in the summaries.
BR tried both its own prototype tool in German and the Agolo tool in English. As most of the selected articles are listicles, the prototype struggled to make a meaningful choice of facts here. Scores for how well the summaries caught the facts and logic were medium to low, with the exception of this scientific article where the facts are straight – though the summary added a bit “too much detail”.
“Doch man weiß nicht genau, mit welcher Geschwindigkeit sich das Weltall ausdehnt. Eine Methode, um die Ausdehnung des Universums und damit die Hubble-Konstante zu messen, ist die Beobachtung von sogenannten Standardkerzen im Weltall. Bei diesen wissen Astrophysiker genau, wie hell sie in absoluten Werten sind, und können damit auch große Entfernungen im Weltall sehr exakt vermessen.”
Agolo did an overall good job in the capturing of facts of the BR articles. In two cases, the main sources of the articles were pulled into the summary. In one case, it is just a description of the person without knowing what he does or what his connection is to the story: “Bui Thanh Hieu is one of the best-known bloggers from Vietnam.” In the other case, a rather random fact about the person was presented: “For two to three weeks Michael was doubting the existence of the Holocaust.”
As in the tests at both SR and Der Spiegel, the bullet summaries did not capture the facts as well as the speakable summaries and they ended up with medium to low ratings. Agolo completely missed facts that it correctly identified as relevant in the speakable versions. On several occasions, the bullet points are composed of multiple sentences like passages with quotes. Again, in an apparent attempt to present protagonists, strange sentences like this one show up without further context: “When Sven Drewert, in his late 30s, wants to increase his credit card limit for the holidays, he experiences a surprise.”
It has been hard to assess how well extracted quotes caught facts and logic. Catching quotes did not work well for any company because identification was difficult and sources were not always clear. They got the lowest possible ratings because there was no connection to the facts of the story. But to be fair, the quotes are often not included in a text to add crucial facts but rather to add opinion and human flavour. So judging how well the few quotes extracted caught the facts may be less significant.
In conclusion: both the start of the text and the genre are crucial factors in extracting summaries that are good at capturing facts and logic. The ideal text is preferably factual, concise and with a start that is clear and logically related to the headline, at least to produce a good result in Agolo’s model. ‘Newsy’ texts are more successful than more creative feature pieces or listicles.
Generally, the speakable summaries (of up to 50 words) got significantly higher ratings in our tests than the bullet points. The reason is that Agolo’s extractive model picks out and includes sentences from the lead or the introduction of the articles for the short, speakable summaries. The bullets are, on the other hand, picked from the whole length of the text. This muddles the logic between the extracted bullets and includes more irrelevant facts.
As shown above, one of the SR tests produced some very encouraging results. A complicated chain of steps: 1) transcription, 2) translation, 3) summarisation, can actually work and provide an almost perfect summary, if the start of the text is so matter-of-fact that it transcribes well, and if the summary format is not too long. However, the more complicated the text is – with peoples' names and specialised terms – the more confused the transcription gets and that also becomes highly detrimental for the summary results.
The obvious observation – that short, newsy articles are easier to summarise than long, creative ones – could perhaps lead to the hasty conclusion that such summaries are superfluous. But having an AI-tool provide short and accurate summaries even from a compact news article could serve multiple purposes, like saving time by excluding this step from the reporter’s workflow; surfacing the summaries as teasers on other pages; or converting news text into audio snippets to be consumed on smart speakers. The SR case also points to a potentially very exciting use case: to use automated summaries of translated (and transcribed) news pieces as a way of opening up your content to an audience that does not speak the original language.
5.2 Grammaticality
The tools we experimented with are mainly extractive – which means they select the sentences that the algorithm evaluates as the most representative. As a consequence, the grammar, which is supposed to be correct in the original text, is also correct in the summary. This conclusion holds, whatever tools or formats we analysed, as long as the tools are used in the language they are trained for: English for Agolo and German for BR.
The most frequent errors we found are due to translation. The issue here is not only a matter of translation quality, but can also be related to specific typographic signs, like the German quotation mark, which is not correctly recognised by Agolo. However, besides translation, a few other limitations are worth mentioning.
The first issue happens when quotes are used as part of the summaries. When a quote is used in the text, the grammaticality is often not as good as in the surrounding text written or reported by the author. This is especially true if the text is an audio transcript. In a radio piece, the reporter uses a written script, which makes it easier for the transcription tool to grasp it and to reproduce it in good grammar. But quotes from interviewees are often less cohesive.
This audio story is an example. This Swedish audio clip (which in English translation is entitled: “Home quarantine - this is what you should keep in mind”) was transcribed to Swedish text, then translated with Google Translate. The first two Agolo bullets are very good from a grammatical point of view, considering the whole text has been automatically transcribed and translated:
- Many companies are now asking their employees to work from home or put themselves in home quarantine if they have traveled to places with many confirmed cases of the coronavirus, covid-19.
- Gunilla Ockborn is an infection control doctor in the Västra Götaland region and she believes that hygiene is important to avoid the spread of infection.
But as the third bullet is based on a quote, the grammatical structure becomes poor:
- If you have access to several toilets, then maybe you can use that the person who has symptoms has a toilet and this is still required to make sure to take care of their hand hygiene.
One reason for this is that the word-by-word Swedish transcription – which in itself is an accurate representation of what the interviewee says – is not good in the grammatical sense. That affects the translation and hence the third bullet.
In conclusion, grammar is by far the parameter that received the highest score among all the criteria we used to evaluate the tools. The grammaticality is generally good because of the extractive nature of the tools. When whole sentences are extracted and glued together with others, this sometimes creates grammatical problems in the new context. Quotes are generally detrimental to the grammatical quality of the summaries as they have a weaker grammatical composition to start with, a problem made worse if the quote is automatically transcribed.
5.3 Journalistic text quality
Journalists write their stories following specific guidelines, adhering both to genre conventions and their newsroom’s particular style. The classic news articlearticle, for example, leads with the most important facts and ranks further information depending on importance. This creates a “per newsroom” definition of quality and makes criteria between the study participants hard to compare. Different genres of articles also played a huge role in defining quality differences. For example, news-style text was generally closer to the human benchmark with the tools at hand. The essential question is how usable the output of the summarisation tool is. Can it be used as it is or is further editing required? How close is the result to the human benchmark summary?
As mentioned in our study setup, this category also considers both coherence and cohesion on top of the general usefulness of the summary in journalistic contexts. Cohesion refers to the creation of meaning on the level of single sentences. Cohesion was not a problem at all with the tools tested in our setup. The extractive models keep the original sentences intact. Therefore, meaning is rarely lost from deconstructing or re-writing single sentences. Coherence is much more relevant to the topic of journalistic text quality. We found many cases where meaning was lost because the summarisation tools strung single sentences together in a way that obscured logical connections and the grammatical subjects of the sentences. This happened a lot when pronouns were used instead of the subject.
Agolo – speakable summary
Agolo’s summaries scored well with Der Spiegel in this category. Minor deductions happened for unsuitable sentence transitions. One otherwise good summary shows one of these coherence issues with sources. The source, Ricarda Winkelmann, shows up unintroduced and out of context:
"It starts slowly – and does not stop for centuries: New simulations show how massively the melting of Antarctic ice is changing the planet. Only one measure would help. " Suddenly," says Ricarda Winkelmann, "it became dark outside the cabin window."
For BR, Agolo served up two perfect speakable summaries and three average ones from long texts that generated no output before that. This is impressive because the texts used by BR were very different from the news texts that the tool is optimised for. As with Der Spiegel above, Agolo pulled out two sentences mentioning sources in the BR summaries that were unrelated to the previous sentences from the lead and did not introduce the respective sources.
For TX Group, the speakable summaries did not work well. They scored lower on the factual rating and contained some repetitions and unclear formulations. While the first sentences were often close or matching the human summary, the next sentences derived from the editorial expectations.
SR also achieved the highest score in one case. They took the audio file from this story: Denmark closes borders as a result of coronavirus outbreak. This 36-second clip was then transcribed using the SR transcription tool which uses an NLP-model from Speechmatics. The Agolo tool produced this short summary:
“At twelve o'clock today, Denmark closed its borders to travelers from Sweden in order to prevent further spread of the Corona virus. Swedish citizens planning to fly from Copenhagen's airport, Kastrup will not be allowed in and trains will not be allowed to cross the bridge between the two countries.”
This summary is relevant, it captures the most important facts and it is correct – if not perfect – grammatically. We can see that it is close to the manual summary done for benchmark purposes:
“Today Denmark closed its borders to travellers from Sweden in order to prevent further spread of the coronavirus. Roadblocks have been set up on the bridge to make sure only Danish citizens proceed across the bridge.”
Jagran New Media made an interesting observation on summary text length. They compared Agolo’s recommended 150 words setting for the speakable summary with the 50 words setting of our study. The speakable summary in 50 words didn't work properly. They were more happy with the 150 words length. The editorial team suggested that the Agolo tool can be explored in a selective way such as selecting the options for 5 bullet points and 150-word speakable summaries if the content length is above 500 words, and going for 3 bullet points and 50-word speakable summaries if the content has less than 350 words. This shows again how much input well-suited to the summarisation model matters.
The conclusion from this specific test is that AI-summaries from automatically transcribed audio can be excellent when it comes to journalistic text quality. However, there are some very important preconditions: the text has to be similar to what the model was trained on, and the transcription/translation should be of good quality without any substantial errors. In some of these cases tested at SR, it was the automatic transcription and translation that led to significant errors that negatively affected the journalistic text quality. To get the required text quality, it helps greatly if only one person is reporting in the story, and that no additional voices from experts with complicated names and titles are included.
Agolo – bullet-point summaries
Bullet-point summaries were a little tricky to extract from the longer articles: the bullet points ended up being too long, too. This was the case at Der Spiegel and BR, where the texts exceeded the recommended length for the tool. TX Group concluded that the bullet summaries contained some repetitions and unclear formulations, but the human edits required were limited.
For the shorter texts at Jagran New Media, bullet summaries worked well and quickly fetched three lines from the original stories, regardless of their lenght. All these lines were relevant and in context with the sense of the article. When JagranNew Media tried the tool with the option of 5 bullet points instead of 3, the tool worked well and the outcome was still excellent.
For BR, one text on the Ischgl research scored a perfect rating by presenting the research and its result very well in just 3 bullet points:
- Reporters have analysed more than 4,000 Instagram posts that were posted in Ischgl between the end of February and the beginning of March; the timespan when the first holiday-makers were probably infected and after which several Corona cases were confirmed.
- Tourists who might have contracted the virus in Ischgl could have spread it all over Europe, for example to Great Britain, Iceland, Poland or the Czech Republic.
- An especially large number of Instagram users in Ischgl then went to the Netherlands and Belgium, to Switzerland, Scandinavia - and, in particular, to Germany.
Agolo also highlighted an important difference to speakable summaries that are relevant to this category: Whereas speakable summaries are more focused on coherence than the bullet points, the latter are more optimised towards covering a broad range of information from the text.
Quotes
With quotes, some parties had issues with extraction in longer texts – or because of the different styles of quotation marks used in different countries and journalistic traditions. The Agolo tool version is designed to pick up the normal quotation marks in English: single (‘English’) or double (“English”) quotation marks, also called inverted commas.
The German quotes are marked with German quotation marks, as illustrated here:
"Professor Gigerenzer puts it this way: „Personally, I think we need a change here. The change must be in that to give consumers the right to understand how their value comes about.“
This could easily be adjusted for paying customers as opposed to our free trial, Agolo told us. Otherwise, extraction worked well and assigning a rating for summary quality makes no sense for quotes because they are not meant to capture the meaning of the whole text. The quote feature was not present in the other tools we tried.
BR tool
Der Spiegel tested the BR prototype with background articles on climate change and saw medium to high ratings for journalistic text quality. Deductions occured in some cases due to unsuitable sentence transitions or a too-high level of detail. This is one example of too much detail, which makes the summary hard to understand for readers without expert knowledge about climate models, as explained in this article:
"Das Kürzel RCP 8,5 steht für ein Worst-Case-Szenario. Das RCP-8,5-Szenario sei am besten geeignet, um den Klimawandel in den vergangenen und den kommenden Jahrzehnten bis 2050 zu beschreiben, erklären Schwalm und Kollegen."
BR saw logic suffer in some cases from missing coreference resolution when the subject of the sentence is unclear as in this example. “Sie” refers to parents, which is not understandable from the summary:
"Der Kinderbonus in Höhe von 300 Euro pro Kind ist einer der Maßnahmen im Rahmen des Corona-Konjunkturpakets. Sie erhalten nun auch den Kinderbonus - ohne dazu einen gesonderten Antrag stellen zu müssen. Wann wird der Kinderbonus ausgezahlt?"
Coherence problems showed up more often in the listicle text summaries, where the logical connection between individual sentences went missing. Overall, this led to medium to low ratings for the summaries. In the best cases, a text would be usable with minor edits. This is an example from a text about Covid restrictions in Germany and Sweden:
"Die Auflagen, die in Schweden gelten, werden von den Behörden des Landes auf einer Kriseninformations-Seite veröffentlicht. Schweden hat verglichen mit Deutschland relativ viele Todesfälle in der Corona-Pandemie zu beklagen. Besonders in die Kritik geraten ist, dass in Schweden viele alte Menschen in Pflegeheimen gestorben sind."
This summary is, apart from the uninformative first sentence, quite good.
5.4 Usability as a teaser
Our testing showed that summarisation tools can be useful to augment the production of teasers – but in very rare cases automated summaries care ready-to-use as they drop out of the machine. There are different reasons for that, but the most important one is that teasers are an example of a relatively sophisticated journalistic skill. Depending on the style of the publication, teasers should ideally summarise the article to a certain extent – but usually not completely, as they should also serve as a cliffhanger to guide readers’ attention to the actual text. This is very difficult to achieve by an automated text – be it as result of an extractive or an abstractive model. Furthermore if there is automated translation in between.
When we consider possible use cases, our testing results are pointing to augmenting existing workflows rather than replacing them entirely. But automated summaries would still be an effective tool, for example for SR, where audio clips accompanied by teasers are at the center of the company’s news strategy. These teasers are today compiled manually so a workflow where an AI-model could suggest three bullet points out of a transcribed audio file could save significant time, even if a human editor had to go over and correct those bullet points. It could also be interesting to summarise audio segments that are cut out from the linear programming.
Going by our testing results, newsrooms still need an editor to shorten the results of Agolo’s bullet-point summaries as the tool mostly offered too-long summaries per each bullet point. This happened mostly when the input exceeded the recommended word count. Often, quotes were included in the bullet points which made the selection non-coherent. Also, in some cases the choice of facts was not excellent compared to a human ranking, so editors would have to check if all the important facts are included. This last aspect points to one of the biggest issues using automated summaries: Shortening a summary or checking the text for coherence is, in most cases, a task that still makes the workflow more effective in comparison with drafting the text yourself. But having to check in the original source if everything important is included, perhaps makes the augmented workflow less efficient than the manual one.
This issue might be eased by using the tools only for factual news stories with a certain length. We found that both BR’s prototype and Agolo work best with short, factual and single-voiced stories, such as SR’s Denmark closes borders as a result of coronavirus outbreak, which was the only article that got a high score in the category “useful as teaser”. Here, Agolo produced three bullet points that would work well as a teaser, with the minor reservation that the last one is a bit vague:
- At twelve o'clock today, Denmark closed its borders to travelers from Sweden in order to prevent further spread of the Corona virus.
- Swedish citizens planning to fly from Copenhagen's airport, Kastrup will not be allowed in and trains will not be allowed to cross the bridge between the two countries.
- Roadblocks have been set up on the bridge, and Danish police are checking the occupants of any vehicles to make sure.
The speakable summaries got medium rankings in most cases. Some of them were just too short to be used as teasers and contained too little information to attract readers, like the one based on this article on technological solutions to tackle climate change causes:
- The climate goals can only be achieved if we actively remove CO2 from the atmosphere.
There were also speakable summaries with a good representation of facts. They had the issue that they worked perfectly as summaries but not as teasers as they contained no incentive for the users to read on. A good teaser adds a “meta dimension” to the facts, i.e. hints at a compelling reason to devote time to the piece, and this is for obvious reasons not something that the extractive AI-summariser does or pretends to do.
The BR tool got a wide range of ratings from very useful to not very useful summaries – mostly affected by the genre of the text. In most cases, a lower scoring was influenced by special article format or genre that led the tool astray, so that scenes or side issues were included in the summary. Again, the best ratings occurred with traditional news texts.
5.5 Delta to publication
The use of AI-driven solutions in newsrooms is not a goal in itself. Development and implementation efforts are only a good investment if processes can become more effective as a result. Therefore the delta, or pathway to publication, is key when it comes to assessing tools and use cases. As demonstrated in this evaluation, even established AI systems need intensive training or manual support to meet the requirements of journalistic summaries. This means that we must address the expectations towards AI tools.
The most important conclusion to start with: don’t expect AI-driven systems to autonomously produce and publish summaries. There are several, mainly quality-related and legal reasons not to aim for this. For example, extractive systems could combine relevant sentences in a way that creates problematic contexts, as we have observed in our evaluation. Abstractive AI summarisers could in turn generate new contexts, or use misleading terms because of biased training datasets. Especially with sensitive or controversial topics, both approaches could create legal issues.
Of course, AI tools can still be useful within the journalistic workflow. For processes that aim at publishing summaries, we consider hybrid approaches to be most valuable. Journalists would have to carry out quality control before publication and a certain degree of editing would be necessary. Ideally, this should be combined with feedback loops to further train the AI algorithms.
Besides publication, we consider many more use cases as relevant for AI-generated summaries [see Chapter 7 on potential future use cases]. In cases such as archive management or research support, a less perfect summary might be acceptable and even valuable. Summaries slightly missing the high expectations towards journalistic teasers would work in these use cases, although context-related quality criteria like the capturing of facts would still be crucial in these settings.
So deciding which quality is required as a minimum depends strongly on the application context. The authors of this study represent media companies with high editorial standards. But other outlets with different standards might be prepared to accept lower quality or publish automated summaries with a disclaimer. Therefore, it seems advisable to first clearly define this setting before looking for and training an AI-driven system for it. Given that alternating genres pose huge challenges for these tools, we would also recommend to clearly define the format and focus on news in the beginning.
In summary: AI-driven summarisation tools can only help us to solve problems if we are prepared to define them consciously, and closely guide the learning process. We should not expect them to be autonomous ghost-writing robots, and most of all we should not – being serious publishers – hand over publishing to them. Only if we commit ourselves to hybrid processes we can really leverage the potential of AI-driven applications.