Shaping the Future of Generative AI with Open Source and Open Science

Matt White
18 min readJul 8, 2023

How open source and open science will solve some of the most pressing issues with generative AI and act as a counter-balance against closed black-box models developed by big tech.

The Open Source Movement

Although open source as we know it today did not begin until around 1999 with the formation of the Open Source Initiative; the predecessor movement, the Free Software movement embraced many of the same values (not to be confused with “freeware” which was free from a monetary perspective but not necessarily open.) I won’t dive into the benefits of one model over the other, but during my career there have been AI projects that have been released under what we might consider open, like Open Lisp in the 1980s, however it wasn’t until the 2010s that we really saw a surge of open-source projects in AI. TensorFlow, Keras, PyTorch, scikit-learn, even Python and R, along with an endless list of libraries helped advance AI research and commercialization by making these frameworks and tools generally available to everyone with permissive licensing. In the 2010s we also saw the release of larger open data sets like imageNet, COCO, MNIST and CIFAR-10/100 which have been relied upon by researchers and learners extensively to train, test and validate their deep learning models.

There are a lot of great AI-related open source projects out there, some are managed by independent and trusted 3rd parties like PyTorch at the PyTorch Foundation (Linux Foundation), KubeFlow at CNCF (Linux Foundation) and Apache MXNet (Apache Foundation) and Project Jupyter (NUMFocus.) These projects usually have a number of companies that maintain the projects with a long list of contributors. Alternatively there are open source projects that are managed by their original vendor, for instance Databricks maintains MLFlow, Google continues to maintain TensorFlow and Hugging Face maintains its Transformers library.

The Open Source Model Movement

The open source model movement really began in 2018, when Google released BERT as open source, it then followed up with the release of T5 in 2019 along with OpenAI who released GPT-2 to the public. In 2021 EleutherAI released their first LLM, GPT-J as open source with permissive licensing (Apache 2.0). However, it was only in 2022 that we saw a substantial increase in deep learning models being released to the public as open source, with a major surge in 2023 with large language models like Meta’s LLaMA, SalesForce XGen-7B, Databricks Dolly 2.0, MosaicML MPT-7B and TTI’s Falcon 40B models and a long list of derivative fine-tuned models.

Now I will say that two of those large language models were not what we would consider open source. First, Meta released LLaMA as open source but attached a research-only license to the weights, which meant no commercial usage. To meet the definition of open source, a model has to be released with its weights and biases without any form of restrictive licensing. Now TTI first released their LLM, Falcon, with a modified version of the Apache 2.0 license, that include a clause that required that any commercial usage where fees were charged over a certain amount, the vendor would have to pay royalties back to TTI. This was met with considerable backlash, and TTI re-released the model under the permissive Apache 2.0 license. Since then, new generative models are consistently being released under Apache 2.0 or MIT permissive license models, which is great for startups, businesses and consumers (I’ll cover more of this further on.)

There are additional generative models I did not mention that are popular including BLOOM and Stable Diffusion, as they use some version of the RAIL license, which is a restrictive license that limits downstream usage. These projects do not meet the definition of open source and their licenses are not accepted by the OSI who maintains the open-source definition and a list of licenses that qualify as open source. Unfortunately the term open source has been misused to describe any project that supplies the model source code and state dict. Apache 2.0 and MIT licenses are the two that are attached to most open source AI projects and models at the moment, however there has been some discussion about creating a new licensing framework that covers model weights since they are neither code nor a data file in the conventional software development sense, and training data has legal nuances that merit having it’s own licensing consideration (many data sets are now being released using CC BY 4.0.)

Companies like HuggingFace have built their business model around open source, they include open source models in their Transformers, Diffusers and others libraries, making access to deep learning models extremely easy. This effort I believe has help advance the open source movement for artificial intelligence. And it appears that the open source AI movement is continuing to gain momentum with new projects being incubated or released on a near daily basis. Many in the public see the open source model movement as a counter-force against the large black-box opaque models being pay-walled by big tech companies. And the movement shows no signs of slowing down, especially in the LLM space where innovations like LoRA are helping to decrease the size of models and their inference requirements.

The Open Science Movement

However there is another movement that has seen little adoption and that is the open science movement. Open science goes further than open source and includes publishing the research and findings behind the process of developing, training and evaluating the deep learning model (research paper), it requires the release of the data sources (w/ data card), the model (w/ model card), training and inference code and the weights and biases along with the optimizer state. EleutherAI is a pioneer in this space and have published their research papers, published data sets and made their models, code and weights available to anyone who wishes to download them. LAION is also a good steward in this space, releasing data sets and Open Assistant under Apache 2.0.

Understanding Open Science and Open Source

Open Source refers to the approach where the source code of a project or product is publicly available. This availability allows anyone to inspect, modify, or distribute the software, spurring collaborative development and innovation. It is released under permissive licensing that does not restrict its downstream use so that anyone can take the code and build products and services around it or use it in their organization for any purpose (depending on the license it may require all changes to be contributed back to the project and/or attribution.)

Open Science, on the other hand, refers to a scientific process that values transparency, accessibility, and collaboration. It encourages researchers to share their methodologies, data, and findings openly, fostering greater scrutiny, replication, and enhancement of research. As I mentioned previously to be considered open science for deep learning models, the project must release all of the following:

  • A publicly published research paper that documents the work done to develop and train the model, as well as how it was evaluated, if it contains any bias or risks and describe the data sets used to train, test and validate the model(s)
  • The model itself.
  • The weights and biases and optimizer state (in a generally accepted format.)
  • The data set (prepared training, testing and validation sets. In some cases raw data.)
  • Companion materials, including documents, scripts and configuration files (training, evaluation and inference, data preprocessing scripts, feature engineering, hyper-parameter tuning scripts, requirements file and any documentation.)
  • A model card.
  • A data card.
  • An open source permissive license (in a LICENSE file.)

The Research Paper

AI research papers, like other scientific papers, usually follow a structure that helps researchers communicate their findings in a clear and organized manner. There is no industry standard format, so the exact format can vary depending on the specific guidelines of the journal or conference where the paper is being submitted, but most AI research papers contain the following sections:

Title: The title is a concise statement of the main topic of the paper. It should be informative and accurately reflect the content of the paper.

Abstract: The abstract is a brief summary of the paper. It should provide the context or background for the study, the research objectives, the methods used, the main findings, and the conclusions.

Introduction: The introduction provides background information and sets the context for the research. It explains the problem the paper addresses, the motivation for solving this problem, the research objectives, and the contributions of the paper.

Related Work/Literature Review: This section surveys the most relevant previous work. It helps to position the paper within the broader academic discourse, demonstrating how the current work is different from or builds upon past studies.

Methods/Approach: This section describes the algorithms, techniques, and methodologies used in the research. It often includes information about the data used for training and testing models, the experimental setup, and the specific algorithms or models developed or used.

Results/Experiments: This section presents the findings of the research. It includes details about the experiments conducted, the data analyzed, and the results of these analyses. Results are often illustrated with figures, tables, and charts.

Discussion: The discussion interprets the results in detail. It explains the implications of the findings, discusses the limitations of the study, and may compare the results with those of other studies.

Conclusion: The conclusion summarizes the paper, restates the main findings, and may suggest areas for further research or potential applications of the work.

References/Bibliography: This section lists the details of the articles, books, and other sources cited in the paper.

In addition to these, some AI research papers may include sections such as:

Theoretical Analysis: If the paper introduces a new algorithm or method, there may be a theoretical analysis to prove certain properties about it, such as convergence or complexity.

Appendix: The appendix can contain supplementary material that doesn’t fit into the main body of the paper, such as detailed algorithms, mathematical proofs, or extensive data tables.

Each section of the paper should be written clearly and concisely, providing enough detail that the study could be replicated by other researchers. Writing a good AI research paper often involves a balance between providing enough detail for an expert audience while still making the content accessible for readers who may not be specialists in the specific topic of the paper.

The Model

The model is the code that represents the untrained model’s architecture. The model architecture specifies the structural design of the model — the way its components are organized and connected but doesn’t include any learned information. In a neural network, for instance, the architecture would define the number and types of layers in the network, how those layers are connected, the type of activation function used in each layer, and other structural details. Some frameworks like PyTorch don’t normally store the model architecture separately, and instead store it serialized with the weights and biases.

The Weights and Biases and Optimizer State

For deep learning models, weights, biases, and the optimizer state are critical complimentary components and are the result of training the neural network. Without weights and biases, the model cannot make predictions and the optimizer state, although not necessary for inference, can allow a user to resume training a model and reveals details like the learning rate, momentum and hyper-parameters used. Let’s go into more detail:

Weights: Weights are parameters that the model learns during the training process. In a neural network, each connection between neurons has an associated weight. These weights determine the importance or influence of inputs on the outputs. Initially, they are typically set randomly or using some specific initialization strategy. During the training process, these weights are adjusted based on the error the model produces in its predictions.

Biases: Like weights, biases are also parameters that the model learns. A bias is an additional parameter that allows the model to adjust its output independently of its input. It’s a kind of offset that allows the model to fit the data better. The bias term in a neural network is a unique neuron that has no connections to the previous layer and always outputs a constant value, typically 1, which is then multiplied by its own weight, effectively functioning as an adjustable constant value in the computation of the layer’s output.

Optimizer State: During the training process, the optimizer updates the weights and biases in order to minimize the loss function. The way it does this update depends on the optimization algorithm used (like Stochastic Gradient Descent, Adam, RMSprop, etc.). The optimizer state typically includes accumulated information from previous iterations of the training process. For instance, in the case of the Adam optimizer, the state includes estimates of the first and second moments of the gradients (basically, moving averages and moving variances of the gradients). This information is used to adjust the learning rate during training.

These elements are usually stored within the trained model file. For example, in TensorFlow and Keras, the model’s weights, biases, and optimizer state can be saved together in an HDF5 file (.h5 or .keras). In PyTorch, the model's state_dict (a Python dictionary object that maps each layer in the model to its trainable parameters (weights and biases)) and the optimizer's state_dict (which contains information about the optimizer’s state and hyperparameters) can be saved together in a pickle file (.pt or .pth).

Each of these files contain numeric values. For weights and biases, these values represent the learned parameters of the model. For the optimizer state, these values represent the state of the optimizer at a specific point in the training process.

Companion Scripts and Configuration Files

The model cannot be trained without the code and configuration files necessary to perform data preparation, feature extraction (when needed), train the model, evaluate the model, and to perform inference amongst other tasks dependent on the architecture, application and modality of the model. These elements allow the user to understand, train and evaluate the model for themselves.

Preprocessing Scripts: These are the scripts used to clean and preprocess your raw data. This might include tasks such as handling missing values, normalization, encoding categorical variables, tokenizing text, serializing images, etc.

Feature Engineering Scripts: These scripts are used to generate new features from your preprocessed data that can improve your model’s performance.

Model Training Scripts: These scripts define your model architecture (for example, the layers in a neural network), the loss function, the optimizer, and the training process (for example, the number of epochs). They are used to train your model on your training data.

Model Evaluation Scripts: These scripts are used to evaluate your trained model’s performance on your validation and test data. They might calculate metrics like accuracy, precision, recall, F1 score, AUC-ROC, BLEU, ROUGE, etc. This can also be a reference to some generally accepted benchmark suite like Stanford’s HELM or EleutherAI’s Language Model Evaluation Benchmark.

Hyper-parameter Tuning Scripts: These scripts are used to tune the hyper-parameters of your model (like learning rate, number of layers, number of units per layer, etc.) to improve its performance.

Inference Scripts: These scripts load the trained model and use it to make predictions on new, unseen data.

Requirements File: If you’re using Python, a requirements.txt file or a Pipfile is used to list the dependencies and their versions that are needed to run your code. If you're using R, you might have a similar file for install.packages() calls.

Documentation: Good projects will always include documentation. This could be README.md files, comments in the code, or separate documents. Documentation explains what your code does, why certain decisions were made, and how a new user can get your code running.

The Data Set

Ideally all data sets are included with the release of the open science project, this includes the raw data sets, and the processed data sets in the form of training, test and validation data sets. For some models, these data sets can be substantial in size, for instance EleutherAI’s “The Pile” is 825GB.

In the companion materials, there should be documentation that describes the process on how the data was prepared. This can sometimes be found in the data card or the research paper as well. All of the scripts used to prepare the data should also be included in the companion code, this can allow another researcher or developer to try new approaches to preparing the data to improve the model performance or re-align the model for different purposes.

The Model Card

A model card is a document that provides a detailed overview of a deep learning model’s features, performance, intended uses, and limitations. It’s designed to help potential users understand whether the model is suitable for their needs and how to use it responsibly. Introduced by researchers at Google, model cards are now increasingly used as part of efforts to improve transparency and accountability in AI.

A model card typically contains the following:

Model Details: This includes the model name, the date it was released, its version, the developer’s information, and a brief description of the model.

Intended Use: This section outlines the tasks the model was trained for and describes the situations in which the model should be used.

Factors: This section describes the characteristics considered in the model’s development and testing. This can include demographic information such as age, gender, ethnicity, or other relevant factors depending on the model’s purpose.

Metrics: This section provides performance metrics of the model. This can include precision, recall, accuracy, F1 score, AUC-ROC, etc. It might also highlight how these metrics vary across different factors or conditions.

Evaluation Data: This section describes the datasets used to evaluate the model’s performance. It might also provide information about the data collection process, the demographic distribution of the data, and any potential biases in the data.

Training Data: This section provides details about the data used to train the model. Like the Evaluation Data section, it might also include information about the data collection process, demographic distribution, and potential biases.

Ethical Considerations: This section discusses the ethical aspects considered during the model’s development, any potential misuse of the model, and steps taken to mitigate such risks.

Caveats and Recommendations: This section provides limitations of the model, potential risks, and recommendations for using the model responsibly.

The specific contents of a model card can vary depending on the model and the organization developing it. However, the ultimate goal is to provide a clear, accessible summary of the model’s characteristics and performance to help users make informed decisions.

The Data Card

A “Data Card” or “Dataset Nutrition Label” is a concept proposed to increase transparency and accountability around datasets used for training deep learning models. The idea behind a data card is similar to the concept of the model card used for deep learning models. It provides essential details about a dataset in a standardized and easily understandable way to help developers, researchers, and other stakeholders understand its content, source, and characteristics, as well as any potential biases that might exist.

Here’s what a data card typically contains:

Dataset Details: This includes the dataset name, the date it was created or released, its version, the organization or person who created it, and a brief description of the dataset.

Collection Method: This section describes how the data was collected. This could include details about who collected the data, when and where it was collected, what tools were used to collect it, and the data collection protocol.

Preprocessing/Cleaning/Labeling: This section outlines the steps taken to preprocess, clean, and label the data. It can also include information about who carried out these tasks and any tools or software used.

Dataset Composition: This includes information about what the data consists of, such as the number of instances, the number of features or labels, the types of features (numerical, categorical, text, etc.), and any other defining characteristics. It may also discuss the balance or imbalance of classes in the dataset.

Uses and Use Cases: This section describes intended uses of the dataset, the tasks it is relevant for, and any known use cases.

Distribution and Access: This section provides information about how to access the dataset, any usage restrictions or requirements, and details about the dataset’s distribution, such as whether it is publicly available, requires permission to access, etc.

Data Maintenance and Governance: This provides details about updates to the dataset, any version control in place, and the responsible parties for the dataset’s maintenance and governance.

Ethical Considerations: This section discusses the ethical aspects considered during the dataset’s creation, any potential misuse of the dataset, and steps taken to mitigate such risks.

Known Limitations and Biases: This section outlines known limitations and potential biases in the dataset. It can include biases in data collection, labeling, or representation, and how these may impact the models trained on the dataset.

Like the model card, the goal of a data card is to provide transparency about the dataset and encourage responsible usage. The specifics can vary based on the dataset and the organization creating the data card as no industry standard exists.

The License File

Including the license file is necessary for any open science or open source project, whether it be tools, frameworks, products or any class of machine learning model. The license file is always named “LICENSE” and has to be included in the root directory of the project’s source code. These license files can be downloaded directly from the maintainer of the license.

An open-source license file typically contains the following:

License Name: This is the type of license under which the project is released. It could be any of the widely recognized open-source licenses, such as MIT, Apache 2.0, or GNU General Public License (GPL), among others. With a preference for MIT and Apache 2.0 as permissive licensing frameworks.

Terms and Conditions: The file contains the terms and conditions for using, distributing, and modifying the software. This includes what users are permitted to do, such as distribute, modify, or use the software for commercial purposes. It also mentions what users are required to do, such as include the original license in any copies or substantial uses of the work.

Disclaimers: It may also contain disclaimers of warranty and liability. This is to protect the creators from being held liable for any damages that might arise from the use of the software.

Copyright Notice: It includes the copyright notice, specifying who holds the copyright to the software.

Contributor License Agreement (CLA): If the project accepts contributions from the public, the license may also mention the terms of the CLA. This is an agreement contributors need to accept, stating that they agree to the terms of the license when contributing to the project.

The Benefits of Open Science and Open Source

Fostering Innovation

Open science and open source are pivotal drivers of innovation. By providing access to research and code, they stimulate a diverse community of researchers, developers, and enthusiasts to contribute, thereby driving the evolution of generative AI. Open source software like TensorFlow and PyTorch have already catalyzed AI research and development, powering AI advancements around the world. Generative AI will be advanced similarly through adoption of open science and open source.

Open source generative AI models with permissive licensing allow startups and innovators to build upon these foundational elements to create compelling and valuable new products and services. Open source also creates a more competitive commercial landscape where developers and consumers have choice and alternative solutions that meet their needs.

Enhancing Transparency and Trust

Open Science increases transparency in research processes and findings, allowing other scientists and the public to scrutinize, validate, or challenge them. Similarly, Open Source affords transparency into the functioning of an AI system, reducing the risk of hidden biases and malfunctions. This enhanced transparency builds trust in AI systems and aids in creating responsible and ethical AI.

The transparency fostered by open science not only benefits the scientific community but also serves as a positive beacon for legislators seeking to responsibly regulate the development and utilization of AI.

Democratizing AI

Open science and open source also democratize access to AI. They make advanced technologies and findings available to individuals and institutions that otherwise might not have the resources to develop or access them. In this way, they reduce technological inequity and empower a more diverse group of innovators.

Leveraging Open Source for Efficient AI Models

One significant trend in AI is the development of smaller, more efficient models. Open source plays a critical role in this process. Developers across the globe can work collaboratively, optimizing algorithms, compressing models, and exploring novel architectures. By crowdsourcing this process, the community can accelerate the development of lean and efficient AI models, making them more accessible to resource-constrained environments and devices.

The Risks of Open Science and Open Source

Despite their benefits, Open science and open source also come with risks. The open nature of these systems can expose sensitive data or proprietary information, making them susceptible to misuse. For example, advanced generative AI models could fall into the wrong hands, leading to the generation of deep fakes or automated disinformation campaigns.

Moreover, while open source can lead to more robust and efficient AI models, it can also result in “too many cooks in the kitchen”, leading to version control issues, inconsistent coding practices, or dilution of project focus. Balancing the benefits of open collaboration with the need for strategic direction and control can be challenging.

The issue of bias is always pervasive since generative models are trained on data produced by humans, however open data sets and the description of the processes used to preprocess the data creates transparency and can allow evaluators to assess the model for bias.

The issue of copyrights is currently on-going in the courts, and I have spoken at length about this in the past. We will have to see how it plays out but I suspect that training generative models will be treated as inspired works, meaning that just like if you or I were to see paintings from different artists and then create our own piece, we would not be infringing on copyrights unless it was a complete replica. The USPTO has already ruled that AI-generated works cannot be copyrighted, although the case was for AI generated art, I suspect it will be extended to language works, code, videos and any other modality.

Conclusion

As generative AI continues to improve and evolve, the principles of open science and open source will be integral to the advancement of the field. They offer the promise of a more innovative, transparent, and equitable AI landscape. However, as we tread this path, we must be mindful of the risks and work towards robust strategies and regulations that can minimize potential misuses and unintended consequences. The future of generative AI, thus, lies in a balanced, ethical, and strategic application of these open principles.

--

--

Matt White

AI Researcher | Educator | Strategist | Author | Consultant | Founder | Linux Foundation, PyTorch Foundation, Generative AI Commons, UC Berkeley