Nov 29, 2023

Code Provenance Definition (Series Introduction)

Author

Matthew Wise

In this blog post, we focus on Code Provenance, which is a foundational requirement to achieve the goal of increasing software authenticity, integrity, security, and compliance.

What is Code Provenance?

Code Provenance is defined as all actors, artifacts, metadata, and events that contribute to the creation, development, and release of software. This includes information on when, where, how, and who produced the source code, including the build system, and steps used to create the software. Code Provenance captures all the foregoing information to verify and determine the authenticity, integrity, and trustworthiness of all actors, systems, and software artifacts used across SDLC.

The optimal state of a code provenance ecosystem can be summarized in a single sentence:

To achieve UNIFIED code provenance within OSS ecosystem and commercial industry—there MUST be an underlying system of record for code provenance. This underlying system of record will enable the capture of code and metadata comprising the actors, events, and source code artifacts that contribute to the creation, development, and release of software—including the development of proprietary code and all external and internal components and tools used to deliver software.

 

image (3)

Figure 1. Code Provenance Ecosystem.

What is the importance of Code Provenance?

Without Code Provenance, it is impossible to proactively observe, trace, monitor, and remediate underlying security and compliance risks in the source code at the earliest stages of software development lifecycle.

As articulated in the NATIONAL CYBERSECURITY STRATEGY 2023 [1]:

“Markets impose inadequate costs on—and often reward—those entities that introduce vulnerable products or services into our digital ecosystem. Too many vendors ignore best practices for secure development, ship products with insecure default configurations or known vulnerabilities, and integrate third-party software of unvetted or unknown provenance.

Why do most open-source software (OSS) projects and commercial software vendors NOT include Code Provenance information?

The answer is simple. Traditionally, software development has focused more on creating and delivering software quickly rather than on tracking and documenting its origins and attesting its provenance. This historical emphasis on speed, efficiency and functionality has overshadowed the need for comprehensive code provenance, resulting in a lack of observability of the origins of the software that is delivered.

Why should the industry adopt Code Provenance best practices now?

The industry must prioritize Code Provenance as a natural response to evolving cybersecurity threat landscape, and the impact of artificial intelligence (AI) on software development. As AI becomes adopted by software creators, there will be an increase in software security breaches and attacks as a function of unregulated and “Shadow IT” (e.g. AI software development), which is a new and growing attack vector. (check our Emerging Cybersecurity Risks of AI and LLMs White Paper)

Current Market Situation

The work carried out by the community regarding SLSA (Supply Chain Levels for Software Artifacts) [2] and the in-toto [3] frameworks has to be acknowledged as a catalyst for relevant discussions around integrity of software supply chains. However, it has some limitations when assessing the broader context and importance of unified Code Provenance.

These frameworks primarily focus on a specific aspect of the Code Provenance landscape—the build process. For example, the current constraint of SLSA lies in its emphasis on the Build Track. While effective in safeguarding against tampering during and after the build process, it falls short in comprehensively addressing concerns related to creator trust, integrity, and authenticity. To overcome this limitation, we need to broaden our perspective and acknowledge that it is equally important to consider the actions and actors involved in code creation before the build.

Archipelo Solution

Archipelo has created the underlying system of record for Code Provenance, which can be used by OSS maintainers and commercial companies. Archipelo empowers individual developers and organizations with Code Provenance to proactively observe, trace, and monitor the origins of their code and continual observation of software creation, maintenance, and distribution. Thus, improving software security and compliance for organizations shipping software to end users and customers.

The Archipelo Code Provenance Engine and developer tools seamlessly capture source code and contextual metadata throughout the SDLC, without disrupting the development workflow. All metadata, artifacts, and source code are correlated with specific actors, including developers and tools (including AI tools), streamlining the process of Code Provenance attestation and verification.

The mission is to make these capabilities accessible to all OSS developers and commercial organizations, encouraging voluntary contribution via automated Provenance capture and verification.

Conclusion

The fundamental issue affecting the entire software ecosystem is the lack of Code Provenance, which makes it impossible to proactively observe end-to-end code creation and development process. Addressing this limitation is crucial in establishing a trustworthy and secure software ecosystem. Code Provenance extends visibility beyond the commit and build level—enabling unified attestation of authenticity and integrity in released and deployed software.

The proposed Code Provenance thesis provides a holistic approach to addressing the core goal of increasing software security and compliance in OSS and commercial software development across the SDLC. Moreover, it provides a well-defined capability to validate the authenticity and integrity of software across the OSS community and commercial software industry.

Co Author: Kacper Skawinski

Resources

  1. “NATIONAL CYBERSECURITY STRATEGY”, The White House Washington, 2023 [https://www.whitehouse.gov/wp-content/uploads/2023/03/National-Cybersecurity-Strategy-2023.pdf]
  2. “Supply-chain Levels for Software Artifacts, or SLSA ("salsa").” [https://slsa.dev/]
  3. “In-toto - A framework to secure the integrity of software supply chains.” [https://in-toto.io/]

Archipelo Intelligent Code Provenance Platform for Software Supply Chain Security

Verify code provenance and increase security and compliance with Archipelo.

Contact Us