LLMs are opaque search engines – change my mind

I am going to start with stating the obvious, and continue in a bit of a roundabout way, so please bear with me here…

When someone enters text into a search engine, their purpose is not normally to see how many or which web pages contain those words, but to find something out, say for example how to clean the filter on their dishwasher, or how to buy a widget. So far, so obvious.

So, in a way, Large Language Models (LLMs) are very much like search engines: they distil the content they find and present it in a way that, hopefully, will result in the user finding the answer they seek. The problem is that LLMs are much less transparent in how that occurs.

Now, I may be showing my age, but I remember when the first search engines came online. Initially, they were simple systems that used something akin to SQL queries to search pages containing certain keywords and presented them to the user. If I wanted to find out how to clean the filter on my dishwasher, for example, I would type “+clean +filter +dishwasher” into Altavista and it showed me the pages containing those three words in the hope that at least one would contain the relevant information.

This system relied on the assumption that the authors of the indexed pages wrote them without considering that they would be indexed by, and accessed through, a search engine. Once more people started using search engines to access information, the authors of web pages realised that their visibility, and therefore their revenue, depended on the results delivered by search engines. This changed how content was written and presented – one stopped writing for readers and started writing for search engines. It was the birth of Search Engine Optimisation. At that point, Internet users started encountering pages created specifically to take advantage of the search engine, with titles like: “Clean the filter on your dishwasher” followed by a deluge of spam and virus links, which the search engine was not equipped to filter out.

Search evolved with “smarter” engines like Google, which devised search technologies to outsmart SEO techniques and keep the results of the search relevant to the users.

As everyone here knows, an online search has three actors with often divergent goals:

Users want to find the information they are looking for.
Search engines want to keep users coming back but also to direct them to paid advertising.
Content strategists want to either “trick” the engines into sending users to their advertising pages as if they were informative or accept the advertising model – in other words, to adopt a SEO and/or a PPC strategy, respectively. Even here, it is in the engines’ interest to maximise the amount of money that advertisers pay for each actual lead, which against the advertisers’ interest.

This divergence of interest, especially between search engines and page optimisers, has created an “arms race” of techniques. SEO tries to create content so that search engines will present it as relevant, while search engines try to filter out that content so that people will either click on paid content or find actual information.

Again, nothing new here.

Back in the day, search engines like Altavista worked on the assumption that the web pages it indexed had not been created specifically to game the system. Today, Google works on the assumption that it can always stay a few steps ahead of the pages that actually aretrying to game it (with varying degrees of success). Similar to Altavista, today’s LLMs rely on the fact that the content they process was not designed with them in mind. In other words, they are using a “naïve” dataset. This won’t last long, however.

Right now, only a minority of people use LLMs like ChatGPT to get answers. But as LLMs are used by more people, the commercial potential of nudging those tools will become greater. Because of the lower transparency of the LLMs in distilling the content of their training sets, this is more complicated than stuffing a page with keywords. If there is a “LLM-optimisation” industry it’s still in its infancy, but it’s likely that the conflict of interest between the tools used to digest the Internet’s content, and the makers of that content, will create the same problems of relevance in the LLM world as they do for search engines today.

It is also possible that there will be insurmountable technical obstacles to influencing the output of LLMs – more than there are with Google results today. After all, SEO was a lot easier with Altavista. Today, it is much more difficult, largely because of the relative lack of transparency in how Google produces its results. LLMs are very opaque regarding how the knowledge they collect from their datasets is distilled into an answer. It would be quite unprecedented in the history of internet searches for it to become impossible to trick search engines, but ultimately it may come down to the difficulty of creating sufficiently large “biased” datasets.

As LLMs become mainstream tools, we will see the same scenarios that we saw with search engines play out: conflicts of interest between users, LLM providers and creators of content, as long as the latter can shape their output to sway the results of LLMs in their favour. And lurking above all this is the eye of regulators, who will certainly have plenty to say if LLMs start to substantially influence what people think about the world.

SHARE ON

Category

Read More

Popular tags

Latest Post

Follow Us