Thursday, September 19, 2024

Who’s scraping your data to train LLMs? Tech vendors rewrite T&Cs, privacy policies to claim new rights over customer data in AI arms race – cue rage | Mi3

Must read

Advertising works. This morning Mi3 watched as a recently alighted passenger on Wynyard station stood transfixed on the platform. She was focused on a Salesforce advertisement playing on the big screen across the train tracks, even as the heaving sea of commuters washed around her.

The ad promised that “Salesforce AI never steals or shares your customer data” and at the end, a cowboy threw some data-thieving hombres off the back of a train. The message was clear enough: Your data is safe with us.

True story. But what feels like clarity in a well-crafted, highly-produced piece of creative still leaves plenty of room for ambiguity as marketing technology vendors like Salesforce, Adobe, and others are now learning.

Salesforce, for instance, earned the ire of its customers last month when reports started circulating widely on social media that it was using customer data from its Slack messaging platform to train AI. Worse, it was automatically opting customers in.

1997 called and wants its workflow back

The backlash started when an angry user posted to Hacker News, a popular developer website, after learning that the only way to opt out of having their data used for model training purposes was to email the company. Animus spread quickly to social media, and soon social media did what social media does best – pushed its users to ever more edgier conclusions.

Pretty soon a story got running – incorrectly – that all those amusing, pithy, and occasionally snarky Slack messages we all love to share with our colleagues were part of the data set being used for training. Part of the problem may be that the Salesforce mothership takes a different approach to Slack which Salesforce acquired in December 2020.

Salesforce which has put AI at the centre of its pitch since the launch of Einstein in 2016, says it will only use customer data to train models if the customer consents. And specifically on generative AI, the spokesperson stressed: “We have a zero data retention policy with our LLM providers, which means that data is never stored or used for model training — or seen by a human.”

Slack also uses customer data to train models, but customers have to opt-out … by email.

According to a Slack statement sent to Mi3, “To be clear, Slack is not scanning message content to train AI models. We published a blog post to clarify our practices and policies regarding customer data.”

The Slack spokesperson also clarified the difference between Slack’s generative AI and other machine learning features:

“For Slack AI, which leverages third-party LLMs for generative AI in Slack, no customer data is used to train those third-party LLMs and Slack does not train any sort of generative AI on customer data.

“Separately, Slack has used machine learning for other intelligent features (like search result relevance, ranking, etc.) since 2017, which is powered by de-identified, aggregate user behaviour data. These practices are industry standard, and those machine learning models do not access original message content in DMs, private channels, or public channels to make these suggestions.”

As tech media website Techcrunch noted, the offence felt by Slack customers may be new, but the terms and conditions are not. They have been applicable since September last year.

The story, however, is emblematic of the problems marketing technology and other vendors are getting themselves into as they try and rewrite the usage rules for ChatGPT era. Problems arise due to their failure to enunciate their policies clearly or when key information is hard to find online.

Catastrophize me

Adobe is another martech leader to experience some recent time on the rack. It spent recent weeks hosing down a wildfire of outrage amongst its customers in the creative community caused by changes in its terms and conditions, and its failure to clearly articulate the meaning of those changes.

In this case, the issue was Adobe updating its terms and conditions to reflect how it uses data and the conditions under which it might access data.

Once again customers took to social media to vent. Some said they were angry that software they already paid for wouldn’t work unless they clicked “agree” to the policy updates (which frankly does stretch the definition of consent), others expressed a sense of betrayal. A lawyer complained that Adobe wanted access to privileged information that was protected by client-attorney confidentiality.

It didn’t help that Adobe seemed to be updating the T&Cs on the fly as the crescendo of voices rose.

As to its actual policies, Adobe does not use customer data to train its gen AI models. Recent updates to its terms and conditions clarified that Adobe’s generative AI models, such as Firefly, are trained on datasets consisting of licensed content from Adobe Stock and public domain content where copyright has expired. Additionally, Adobe maintains that they do not assume ownership of customers’ work and only access user content for purposes such as delivering cloud-based features, creating thumbnails, or enforcing their terms against prohibited content – this is the issue that seems to have triggered the flare-up on social media.

The problems Salesforce and Adobe have encountered reflect a growing trend around personal and corporate data.

Do the right thing?

Customers simply don’t trust the tech sector to do the right thing mainly because they increasingly don’t trust anyone to do the right thing.

John Bevitt, the managing director of Honeycomb Strategy has a clear sense of where customer scepticism is coming from.

In his recently released Brands beyond Breaches report he noted, “Customers, now more than ever, seek assurance that their data doesn’t just fuel profits but is respected and protected with the utmost integrity – and with a constant stream of data breaches being announced, it’s no surprise that distrust is on the rise.”

His report found that distrust pervades every single industry with media companies (65 per cent), search engines (58 per cent), and market research firms (50 per cent) rounding out the top five list. Not far behind are ecommerce companies (49 per cent), industry trade publications (48 per cent), online services (48 per cent), health and fitness companies (48 per cent) and technology brands (45 per cent).

Part of the problem is a tech industry culture that prioritises legal compliance over clarity and transparency.

Mi3 asked a range of vendors “Do you use customer data to train AI models and if so do you allow customers to opt-out.”

These are both binary propositions, but we rarely enjoyed the precision of yes or no answers.

Another example is Sitecore, a tier one CRM and martech provider. It says the use of Gen AI within its products is optional and that it has designed its Gen AI features to be “distinct and discernible within our cloud products, giving you the freedom to choose whether to use them or not.”

A spokesperson gave a very precise answer about whether customer data generated by input prompts and the subsequent output was used to train models.

“No, neither Sitecore nor its third-party AI model providers will use these inputs/outputs for their own purposes, meaning we don’t use prompts or results to train models. For additional comfort to you, Sitecore is committed to ensuring that our third-party providers adhere to this same policy.”

Beyond that, its third-party providers “may only use your customer data to provide its services to you and will process the customer data temporarily for such purpose.”

But even this answer still allows for ambiguity as it only relates to the inputs and output of the generative AI capabilities.

It doesn’t seem to preclude using other customer data to train models. We have asked for additional clarification.

Clarity is not difficult

It shouldn’t be this hard for vendors to answer simple questions about how data is used. 

Both Pegasystems, a real-time interaction management (and business process management) vendor, and Qualtrics and experience management platforms have demonstrated and released material improvements in capability fuelled by generative AI at their international user conferences this year.

In both instances, Mi3 asked if customer data was used to train models.

Alan Trefler, Pega’s CEO, said not only does his company not use customer data to train LLMs, it doesn’t train LLMs at all.

Peter van der Putten, the director of Pega’s AI Lab further explained the firm made a deliberate choice to avoid the training game. It’s too difficult and costly, he told Mi3. Instead, Pega wanted to see how much juice it could extract without training models. Its recent Socrates and Blueprint initiatives are examples of the transformative power of LLMs, and what can be achieved without using customer data.

Meanwhile, Qualtrics, an experience management platform launched a series of new AI-powered capabilities at its conference to help organisations maximise research investments and deliver insights faster. CEO Zig Serafin was clear about the fact that it uses customer data to train models but said Qualtrics provides a simple and clear way to opt-out.

Latest article