Despite the range of exciting use cases, AI tools can become a liability, putting sensitive and proprietary data at risk when using external components like computational libraries or when trained on cloud infrastructures. At Symphony Labs, we decided to invest in a new field of research, Privacy-Preserving Machine Learning (PPML), because we want to make sure customer data is never exposed to external threats. Our first goal is to create a secure pipeline for Amenity and LLM’s integration, where AI APIs would provide the base-layer Conversational AI and extend the power of Amenity’s algorithms for financial analytics research with reasoning, input parameter personalization, and logic capabilities.
But what are the threats that Machine Learning faces?
Security challenges against attacks
In typical machine learning scenarios, a central server first collects data from multiple sources, then trains a model on the combined data, and finally, sends the model back to the sources for deployment. In ML, where data is collected on a central server without advanced protection, the computation host can directly access incoming data.
Moreover, all three stages of the ML life cycle can be the targets of attacks such as stealing models, model inversion, model and data poisoning, or data reconstruction by breaching the model parameters. According to Fan Mo, Zahra Tarkhani, and Hamed Haddadi, in their publication Machine Learning with Confidential Computing (2022), attacks are of two types: 1) attacks on confidentiality and 2) attacks on the integrity of the ML process. Attacks on confidentiality put the privacy of data and intellectual property of models at risk, where attackers are mostly interested in exploiting unauthorized sensitive information or the model itself. Integrity attacks aim to actively ruin the model by adding, for example, calibrated noises to data to produce wrong prediction results.
Privacy measures must be reliable in order to guarantee that no breach can occur while sensitive data is handled or models are trained and deployed.
Symphony will develop PPML models that protect the entire machine learning life cycle of Symphony AI-based tools to ensure data and client privacy protection.
Ensuring data security
Historically, data has been secured at rest and in transit, but remained at risk while in use. Only two case scenarios can protect data in use: We can either store the data on local hardware (no data can be shared), or centralize it in a confidential, secure space.
Trusted Execution Environment (TEE) is the new technology used in PPML that avoids sharing sensitive data and centralizes it.
TEE is a hardware-based confidential computing approach, relying on specialized hardware features to create secure and isolated enclaves within a computer. These enclaves are small regions of memory that are protected from other software and hardware on the system, including the operating system itself. When a program executes within an enclave, it can only access the data and code that are explicitly loaded into the enclave, while all other data and code are inaccessible. Data is kept encrypted in memory as it is being processed within the enclave. This means that, even if an attacker gains access to the operating system, they won’t be able to access the enclave.
The principal cloud providers, Google, Microsoft and Amazon, offer confidential computing as a service, enabling developers to use their technology without needing to set up and manage their own infrastructure.
Ensuring model performance and security
Though the security of TEE is promising, integrating ML in such an environment raises technical challenges. The implementation of TEEs can pose problems of compatibility with software like programming languages, libraries, and hardware like GPUs that play an important role in ML. Furthermore, the use of secure enclaves can introduce overhead in terms of memory and computation, which can further impact the performance of ML workloads. Nonetheless, cloud providers are continuously working on optimizing their confidential computing offerings to provide better performance for secure processing of sensitive data.
While data and models are protected inside the TEE, it is during the third step of the ML deployment – extracting results from the enclave – that the model parameters can be tracked back and attacked. These so-called ‘inference attacks’ enable attackers to reconstruct the clients’ local data. To protect privacy, careful algorithm design and privacy analyses are a necessity. Typically, some noise can be added to the datasets to guarantee privacy of results or specific architecture of deployment can be designed.
In a nutshell
Symphony is currently evaluating how we can best guarantee the performance and security of our ML models. Their integrity will need to be thoroughly analyzed and tested against benchmarked attacks. We look to achieve high performing AI tools without security tradeoffs by designing secure architecture and robust models, aiming to guarantee results accuracy and privacy on the top of data and model confidentiality.
Securing the entire ML pipeline will allow us to integrate conversational AI into Symphony and build innovative models, pushing the boundaries of existing AI techniques. Our AI tools will run on messages and calls exchanged on Symphony, as well as on public data like social media, or private files and documents submitted by our clients. Our solutions will drive actionable insights to portfolio managers, research professionals, analysts and other financial markets participants.
The goal is that Symphony AI tools will help our customers cut through noise and provide needed business intelligence in real time in a trusted infrastructure.