Understanding Bacalhau 1.0 in One Article Unleashing the Potential of Private Data

This article is based on a speech by Simon Worthington at the Boston Summit in May 2023.

Bacalhau fundamentally changes the way data processing is done by supporting local computation. Instead of moving data to run analysis on top of the code, Bacalhau sends the code to where the data is located. By preserving the data and allowing authorized access, auditing, and controlled computation, more data can be used while reducing the risk of misuse. This is the answer to solving data governance issues. The growth rate of data is 45% faster than network bandwidth, and 57% of data is stored outside of cloud or traditional data centers. For any organization with large-scale operations, moving data is slow and expensive.

There is also a good reason to keep data locally: control. Almost 100% of data is governed in some form, whether through mandatory regulations like the Health Insurance Portability and Accountability Act (HIPAA) or the General Data Protection Regulation (GDPR), or through local protection of sensitive financial or company secrets. Moving data into computation takes it out of its usual secure zone and increases the risk of misuse.

Most data is not strictly open or closed but exists within certain boundaries. Within those boundaries, specific access permissions can be granted for specific purposes.

Source: The ODI

Since 2008, the total amount of fines imposed globally for data governance has reached nearly $250 billion. Therefore, it is not surprising that most companies fear data sharing, which also leads to 68% of enterprise data being undeveloped and unused. In fact, most controlled data can be shared and used for more effective decision-making, but only if the right people and purposes are involved.

Data sharing requires technological enforcement

Many organizations attempt to meet this requirement through strict data sharing agreements or contracts. Establishing these agreements is expensive and time-consuming – for enterprises such as national governments or financial institutions, it can take months to achieve data sharing among internal teams through data governance.

Even worse, these agreements often don’t work – most data sharing agreements are completely unenforceable and only provide a false sense of security. Once data crosses trust boundaries, only soft mechanisms (such as trust that everyone will comply with the agreement) can prevent misuse. The practical operation of shared data is invisible and difficult to regulate.

“It turns out that contracts or agreements between data providers and data users often fail to work.

In the Cambridge Analytica scandal, contract terms were completely ignored, and personal data was abused.

The lack of any compelling technical evidence can prevent courts from obtaining effective information and make it difficult for regulatory bodies, politicians, journalists, and the public to understand what happened.”

– “Putting the trust in data trusts,” Register Dynamics, 2019

Obviously, what we need is a new method of reusing data across trust boundaries: through this method, analysts can access data simply and under control, while data owners do not face the risk of regulatory fines and making headlines.

Bacalhau makes data sharing visible and auditable

At Bacalhau, we believe that local computation of data is the answer to the challenge of data governance. By retaining data and allowing authorized access, auditing, and controlled computation, more data can be used while reducing the risk of misuse.

More importantly, as Bacalhau is a distributed computing platform, there is no need to transfer data to a central storage. Data can be stored wherever it should exist within an organization, avoiding difficult organizational changes and not depriving data owners of any control.

We are proud to announce that as part of Bacalhau 1.0, we have added job and data control features. With Bacalhau, data owners can control who, what, where, why, and how their private data is computed.

Bacalhau controls code and output

Bacalhau uses a two-step process for job control. First, data owners have the opportunity to review whether the job complies with their policies. This pre-control phase occurs before the job starts running and allows control personnel to approve or reject computation based on the data to be used, the requester, and the code to be executed for the job.

Although humans always remain in control, not every decision needs to be made manually. The pre-control process is highly flexible and can be automated as needed. Data owners can set policies, perform in-depth checks on the computation to be run, set different policies for different personnel, and invoke complex algorithms for analyzing security and risk. When a job is not suitable for automated control, a final decision can be made manually.

Bacalhau provides two gateways for computation: one before computation and one after computation.

Once approved, Bacalhau sends the job to the appropriate executor, which can only access the requested data and is securely isolated from the host system. Bacalhau imposes resource restrictions on the job to control processing power and memory usage.

Although pre-control provides a reasonable first line of trust defense, generally determining what a computer program will do without running it is a challenge that requires technical skills. The UK Office for National Statistics and other related controlled research environments have been securely allowing controlled access to data for decades, and we have drawn on their experience and practices in the digital field. Therefore, in addition to pre-execution control, Bacalhau also allows modification of the results after execution, before they are released to the job submitter.

When Bacalhau completes the computation, it saves the results to a private pre-release domain. Then, administrators check the results against the background of the job to determine if they are the expected results. If the administrator deems the content suitable for sharing, they can download the results. Importantly, access to the private storage area is strictly locked, and users can only stream the results of their jobs through Bacalhau’s download function.

Just like pre-control, complex analysis can be performed on the results. With Amplify technology, data owners can automatically detect personal identity information (PII), summarize CSV and other tabular data, and analyze the content of images and video clips. The generated metadata can be used for automatic result publication and provide valuable information for human decision-making.

Control opens up a new joint learning

Computing on data separated by trusted boundaries enables a large amount of data sharing, but currently there is no secure technical solution. If the data held by an organization can generate common value when shared on a larger scale, these organizations can now apply Bacalhau job moderation and open data access without complex data governance.

For example, a university can provide more data to citizen scientists or external researchers, a government department can allow another department to analyze its data, or a team in a highly regulated financial institution can allow another team to analyze its data in depth. In summary, it is important not to release raw data to users with low trustworthiness. Bacalhau ensures that users only get their analysis results, nothing more.

The same distributed controlled computing model can also achieve collaborative learning between participants from different organizations. With Bacalhau, independent organizations can perform in-depth analysis from aggregated data without sharing the data. With federated learning technology, data scientists can now train machine learning or AI models on datasets from many different independent or even competitive organizations without losing data control and being able to accurately see data usage.

For example, central government agencies responsible for formulating macro policies can leverage the data held by local organizations. Similarly, industry institutions such as insurance regulatory agencies can train models by submitting joint learning Bacalhau jobs to all their member insurance companies.

If the dataset is centralized, it is likely to result in the sale or misuse of these valuable aggregated data; but if the data is kept locally, each insurance company can determine that their data is only used for mutually agreed and beneficial purposes.

Analytical computing islands for specific topics

Finally, the fine-grained control over job execution provided by Bacalhau now allows administrators to become gateways to enter computing islands. In this structure, independent computing providers and data owners interested in providing resources for specific purposes can delegate job authorization to trusted moderators.

For example, scientists who collaborate to collect medical data that helps treat cancer can provide data and computing through external moderators they trust. Moderators only accept jobs that comply with agreed policies – in this case, only jobs that contribute to new cancer treatments are allowed.

By using this method, scientists can delegate external access requests to controllers, allowing them to focus on larger public welfare goals. With Bacalhau’s powerful audit logs, scientists can verify in the future whether the controllers have acted in accordance with agreed policies.

Bacalhau is the future of data sharing

We are excited to release job and data control features in Bacalhau 1.0! We believe that data computation represents a new approach to data sharing – in short, ensuring data security by not sharing data!

Today, some companies and government institutions have recognized the potential of controlled computation across trust boundaries, and we are working with them. If you would like to learn more about how these features can be used for you, please join Bacalhau Slack or contact us directly.

Like what you're reading? Subscribe to our top stories.

We will continue to update Gambling Chain; if you have any questions or suggestions, please contact us!

Follow us on Twitter, Facebook, YouTube, and TikTok.

Share:

Was this article helpful?

93 out of 132 found this helpful

Gambling Chain Logo
Industry
Digital Asset Investment
Location
Real world, Metaverse and Network.
Goals
Build Daos that bring Decentralized finance to more and more persons Who love Web3.
Type
Website and other Media Daos

Products used

GC Wallet

Send targeted currencies to the right people at the right time.