Module Assets | Module Assets | Round 22

oceandao · October 11, 2022, 11:40pm

Project Name

Module Assets

Project Category

Unleash Data

Proposal Earmark

New Entrants

Proposal Description

1 Introduction

The lack of interoperability allowed for platforms fill the void (ZenML, Lightning, Fast.ai, Hugging-face). These companies attempt to connect different tools into one cohesive toolset, striving to be the one platform to rule them all. Despite many platforms proclaiming their support for open ML, many do not fully open source their codebase, and some monetize their community’s intellectual property to cooperate clients . Platforms can also lock-in developers, as they build tools that are fully compatible with their ecosystem and not competitors. Paradoxically, these platforms create walled gardens that section developers from effectively collaborating. Asset modules attempts to be a standard for storing machine learning assets that is platform agnostic. Asset modules can easily integrate into decentralized storage protocals including IPFS, Filecoin and Arweave. We believe that such a standard is necessary for interoperabilty across ML assets, and reduces the risk of vendor lockins. We also hope that forming such a standard encourages developers to store, share and connect assets. Having these abilities also grants developers the chance to monetize directly from consumers without from their creative endeavors

1.1 Asset Module

An asset module is an abstraction over any machine learning asset stored in a folder. This can be a dataset, a model or a processing function ( ie: tokenizer). An asset consists of several core components that allow them to generalize over any machine learning module. By default, each asset consists of a main script (python, java, rust), a configuration file (yaml, json), and state data file(s). In certain cases, a module would include files for installing dependencies. This can be additional data files, or configuration environment files (Dockerfile, conda .yaml , pip .txt).

1.2 Asset Registration and Sub-Licensing with ERC721

Assets can be represented by on chain ERC721 smart contracts. This allows for developers to register their assets on an block-chain. This is currently restricted to EVM smart contracts but will hopefully generalize to other block-chains in future work. The smart contract stores the asset’s

figure 1: MetaHub Assets DDO and additional info

1.3 Decentralied ID’s and Decentralized Objects

The asset’s decentralized identifier (DID) is a type of identifier representing the asset’s decentralized digital identity . A DID Document (DDO) is a JSON representing the DID’s information. This allows providers to include information relevant to the asset, and include additional fields (additionalInformation) for accommodating any asset that may need extra fields in its description. Asset Modules use these additional fields for its inclusively of connecting with any provider and platform .

1.4 multiURI: URI agnostic file pointers

Assets can be stored under multiURI objects which can reference the location of a file regardless of its type. This was inspired by libp2p multi-address format for handling multiple identity types. This allows for files to be compatible with multiple decentralized storage protocols, including IPFS , Ceramic, Arweave, and Filecoin.

Figure 2: multiURI template(left) and example (right)

Figure 3: Path2Hash example of a dataset being saved and mapped to its content address

1.5 Path2Hash Maps: Encoding and Decoding Any Folder

When uploading an asset folder, each of the files include their own URI address. In the case of IPFS, it is its content address. Assuming the file is stored locally (pinned) and the hosting peer is connected to the IPFS network, anyone can fetch the file knowing the content address (cid). Each cid is paired with the relative file path with respect to the folder . This allows for any recursive file structure to be represented as a flat map between the relative paths and the cid’s. This map is saved inside the asset’s metadata for the user to reconstruct the folder into its appropriate structure.

1.6 Dataset Assets

A dataset module consists of a configuration YAML file, a main file, and additional files (sample files) . A dataset can consist of one or multiple splits (train, test, validation). Each split is saved as a separate asset folder. Datasets can be fragmented into several dataset shards. Each shard represents a module folder nested inside the split folder. Sharding can be useful for partially loading datasets for fast development . In addition, users can bundle several dataset shards into a mosaic of samples from different datasets. This mosaic can be also published as an asset, exhibiting the recursive and reusable nature of asset modules. Users users to asynchronously load a batch of samples from multiple datasets with compatible schema. The dataset’s DDO includes information such as tasks, categories, split information and feature information (type, name shape).

1.7 Model Assets

The model asset is structured similarly to the dataset asset (config, module script, model parameters). If a model is too large, it can be fragmented into several files, with each representing the parameters of a sub-module . In Pytorch, we can partition the nn.Module’s state dictionary into several fragments, based on the module keys. This can lead to a future of dynamic inference, where modules can be loaded from disk during forward/back-word computations. This ability to break a model’s parameters into sub-modules allows for developers to swap in and out model parameters, allowing for modular interoperability. This theoretically allows for large models to run inference on local computers, being limited by the size of storage (inference may not be ideal). The model can then be loaded by the module script which contains a decoder from the saved content hashes.

Figure 4: Examples of model module asset

1.8 Asset Interoperability

Assets are designed for interoperability, allowing for machine learning assets to be connected across different ecosystems . A simple example of this involves building an ML pipeline with a Huggingface dataset (data-sets library) and an off the shelf Pytorch model that isnt compatible with the transformers library . Developers can also bundle assets, which can be applicable for representing full pipelines and ensemble models. Another fun example is having an asset represent and API token. Developers can bundle access tokens for different API’s, selling the asset as an API bundle. This allows for developers to use multiple API’s without having to pay for all of them.

1.9 Asset Dependencies

An asset module should be portable across multiple compute environments . This involves ensuring additional dependencies and may include a Dockerfile, pip requirements file, helm chart or other additional files. We believe that the user should have the option to determine what part of the environments they want to synchronize for shorter configuration times. For instance, a user can choose to avoid building the asset’s docker container if they only need a few python packages. This lazy configuration promotes faster module builds, but increases the risk of inconsistent run-times.

1.10 Asset Accessibility

Assets can either be free to the public, private to a set of users, or sold for a fee. Asset modules are used to store machine learning tools in MetaHub, which serves as a decentralized machine learning tool aggregator based on the Ocean Protocol. Free assets are stored on the test-net to avoid transaction fees, while paid assets are issued on the main-net for securing value transactions.

1.11 Replicating Asset Registration Across Chains

Because we want to store free assets on a test-net, this may cause issues with respect to these assets being not permanently stored. In order to resolve this, off-chain workers can be used to replicate asset DDO’s across multiple test-nets. Because this is expensive, we can allow users to back their own assets using open-source off-chain synchronization modules , which listens and mimics transactions across chains. This is possible using cross chain interfacing modules, which rely on interfacing with the smart contract Application Binary Interface. This replication can also be done for main-nets, although this may not be as useful as main-nets are designed to be stable on their own.

1.12 Token Agnostic

The rise of digital tokens also manifested many tokenomic structures that are unfair for consumers. Subscribing to a specific token for a service locks you into that ecosystem, and prevents interoperability with other similar token assets. Asset modules are designed to be token agnostic, and allow for providers to prefer any token as payment. Providers can also limit payment to their own token.

Grant Deliverables

Asset Accessibility
Asset Dependencies
Asset Interoperability
Model Assets
Dataset Assets

Project Description

Asset modules attempt to be a standard for interconnecting machine learning assets across centralized and decentralized communities. This involves forming an inter- operable schema that is platform agnostic such that any ml asset can easily integrate into decentralized storage protocols like IPFS and Arweave . We believe that such a standard is necessary for interoperability across ML assets, and reduces the risk of vendor lock-ins. We also hope that forming such a standard encourages developers to store, share and connect assets using peer to peer networks This provides developers with full autonomy over their assets, granting the opportunity share, and monetize directly with consumers without intermediaries

Final Product

A more shareable machine learning ecosystem that provides developers with the full autonomy over their creations, and ultimately hope that developers can monetize their tools without requiring intermediate platforms gating them from the consumer.

Value Add Criteria

Asset modules strive towards a more shareable machine learning ecosystem that provides developers with the full autonomy over their creations. We hope that developers can monetize their tools without requiring intermediate platforms gating them from the consumer.

Core Team

Salvatore Vivona

Role: Software Engineer, AI engineer

Relevant Credentials:

LinkedIn: https://www.linkedin.com/in/salvivona

Github: salvivona · GitHub, commune ai · GitHub

Role: Software Engineer, ML Engineer, Web3 Engineer, Frontend Developer

Advisors

Funding Requested
3000

Minimum Funding Requested
3000

Wallet Address
0x9C73542592BB3534Fa0C847580043b3563D1161b