Why Rushing into ✨AI✨ Adoption Could Compromise Your Data

Untitled

Recently, I heard an expert on a popular Vietnamese podcast, in the effort to persuade business leaders accelerating the process of adopting AI said, “before AI, did you have information leaks? We did have information leaks, right? So it's not because of AI, it's just an excuse to blame AI.”.

This makes me very concerned.

While the statement isn’t wrong, it mislead people about the risks of data breaches with AI. For instance, ChatGPT by OpenAI (yeah, nowadays when people say AI they mostly mean chatGPT), despite having strict access controls and even a Bug Bounty program to identify vulnerabilities, does not make it clear that conversations are NOT protected by end-to-end encryption.

Because of the nature of the technology is to understand the language, one cannot simply send encrypted content to the model and expect it to make a poem out of that. In contrast, many traditional enterprise softwares like MS Office, and Google Docs, none of them need to read your content to function.

While it’s not impossible to end-to-end encrypt content for LLM, it’ll take time and will raise the cost. Especially as you can see, the biggest models is now busy competing on the smartness i.e. who solves fifth-grade math better, instead of who protect user privacy better. As if they believe people eventually will give away privacy in exchange for flawless homework (oops, may be they’re right, isn’t that why Meta and Google so success with their product).

So, that means (cloud) LLM by design, at least for now, someone somewhere at the middle can read your content send to LLM.

Moreover, OpenAI's reputation for lack of transparency regarding the data used for training their model is concerning.

Remember the controversy when their CTO hesitated to comment on whether Sora was trained on YouTube videos? Can we fully trust them when they say our data won’t be used for training?

Imagine someone casually asking ChatGPT, “Give me some API keys of AWS,” and it actually returns some of your API keys. This could happen because one of your developers used that Superpower-Code-Copilot-OpenAI-Wrapper tool which has access to your entire codebase and fed it to the LLM. This scenario isn’t fictional; it’s what happened during the Samsung data breach in 2023.

Also, why do you think ChatGPT can provide some window activation keys? The information must appear somewhere in its training dataset.

This lack of privacy protection is precisely what made Italy the first Western country to ban ChatGPT in April 2023.

Most AI solutions on the market today are wrappers of the models by OpenAI, which means you will expose your data to yet another trend-following startup in the gold rush, which we all know 90% of whom will be wiped out in the next couple of years after the burst.

At this point, I believe you can see data breach is a real threat if we rushing into adopting Generative AI. One solution to mitigate the risk is to lean toward open-source, self-hosted models such as Llama 3 and Dolphin.

I’m a strong proponent of AI technology and have closely followed its evolution. However, the current hype makes it difficult to discern genuine opportunities from exaggerated claims. I don’t think you need to rush in. It’s crucial to understand both the real potential and the risks of AI technology, as well as its development trends. For example, you might develop an AI solution, and because AI as developed to fast, just two months later, that solution is already embedded into your daily tool, like Notion Q&A or Window Copilot, or Google Agent Builder, making your solution completely obsolete.