Posthuman Codex

yetti4 · October 5, 2021, 9:38pm

Project Name: Posthuman Codex

Github : GitHub - PosthumanMarket/Posthuman.py at v2

Twitter : https://twitter.com/PosthumanNetwo1

In one sentence: Posthuman publishes trained AI models on Ocean Market, accessible only using Compute to Data. For this proposal, PH will train and publish an open replication of OpenAI Codex, which we are calling PH-Codex.

Proposal Wallet Address:

0x21e06646433954aabace8e3d93d502e423249299

Grant Amount: 50,000 USD

Project Summary

Posthuman Makes available AI-models-as-a-service using web3 technologies of Ocean Protocol. Model Curators pool funds to datatokens of useful models, and Model Consumers purchase inference and evaluation on the models they find most useful.

Posthuman’s decentralised architecture achieves many goals that are impossible with centralised AI providers:

Decentralised Model Ownership: Model is owned by the community holders of the datatoken - allows anyone to invest in and profit from useful AI models.
Permissionless Development: Fine-tuning advanced AI models is permissioned on Web2 APIs like OpenAI, and the fine-tuned models are owned by OpenAI and can be unilaterally deleted. In contrast, anyone can fine-tune one of the Posthuman Models on their own data, and the resulting model will also be community-owned.
Censorship-Resistant Access : Access to AI is fast becoming a basic necessity for a productive life, however such access can easily be censored by centralised providers. With a decentralised alternative, any holder of crypto is guranteed to be treated equally by the protocol.

Additional benefits include-

Verifiable Training and Inference: The end user can know for sure which model served a particular inference request
Zero-Knowledge fine-tuning: The marketplace controls the models, ensuring each person who contributed to the datatoken Pool is rewarded fairly, as all value created by these models remains on-chain and cannot be ‘leaked’.

In v2 [OCT 2021], we’ve trained two custom, commercially useful NLP models, and are now introducing them on Ocean Market (Polygon) Mainnet -

Model 1: AI Assistant as a service - A custom gpt2 model trained on conversational data, this model can be used to build and run conversational AI chatbots across fields, AI based games like adventure, etc. Published on Ocean Market. [Model: https://market.oceanprotocol.com/asset/did:op:4Ff0c8049458C19E08e125D6536af8716be5Ffa8
Algorithm: https://market.oceanprotocol.com/asset/did:op:16915E68A8b427321c2117Cd4B4b80d280962027 ]

Model 2: Wikipedia QA as a service - A custom roberta model + retriever pipeline trained on Open-Domain wikipedia question answering. This model can answer any question from the entirety of wikipedia text. This can be used for research across fields, such as medical, historical, academic & scientific research. [publishing this week]

For further information on work completed, see our Documentation

Proposal - Posthuman Codex

With increased funding limits for veteran teams, we’re excited to undertake a much more ambitious and powerful project - replicating an open and permissionless version of OpenAI Codex, the AI coding assistant. Codex lets users write entire functions, scripts, interfaces and even games using only a few lines of natural language input, and no code. It is an incredibly useful tool in the hands of coders, however access to the models is currently walled-garden and heavily permissioned, keeping it from truly taking off as an everyday coding tool.

The reason why it is impossible to compete with Open AI codex as a web2, API startup can be summed up in a word - trust. Nobody would trust an unknown person providing an API as serving the right model reliably. Posthuman AI on Ocean solves that problem - using the classic web3 ‘don’t trust, verify’ paradigm. Users need not trust PH, the model DID etc. is stored on the blockchain and cannot be tampered with. If users run an evaluation script just once, they can be certain that the model will always give that performance.

Training Spec

We plan on using a large portion of the grant ($15-25k) towards GPU/TPU costs of training our AI model. The models will exclusively be made available on Ocean/PH Marketplace, channelling all usage revenue to Ocean as Data-Token Consumption value.

We will train starting at two model checkpoints:

PH-Codex-Medium - 345M parameters - this size is very useful for autocomplete and writing small functions. It is small enough to run on standard hardware and will be made available on Ocean Market.

PH-Codex-Jumbo - 6B parameters - this model size has shown a great ability to write code, comparable to the much larger OpenAI GPT-3. This model will be made available on custom hardware (Our fork of Market) as soon as it is ready. Meanwhile, training loss, metadata, and examples of inference will be shared.

PHC-M will be light enough to run on standard hardware and we will publish it right away on Ocean Market once trained. PHC-J will be published on our fork of the marketplace with GPU-based compute-to-data as soon as it is ready (~1mo). The models will be trained on the Codenet dataset- Kickstarting AI for Code: Introducing IBM’s Project CodeNet | IBM Research Blog

We’ve picked the pretrained GPT-J checkpoint in particular because it has displayed great performance on writing code - very similar to the much larger GPT-3 model on which OpenAI Codex is based. For code samples written by untrained GPT-J see: Fun and Dystopia With AI-Based Code Generation Using GPT-J-6B | Max Woolf's Blog

ROI Calculation

This is a proposal specific ROI calculation. Our general (1.2) ROI calculation for Posthuman AI models as a whole still applies, and can be found here.

Codex-like software is poised to shake up the multi-trillion software developer market. Even conservatively, estimates of Codex/Co-pilot revenues based on initial interest, range from hundreds of millions to over a Billion.

A bottom-up calculation goes as follows - OpenAI charges roughly $1 per 10 lines for their most advanced AI model (GPT-3 da-vinci), based on which Codex is built. Even if they don’t increase the rate any further, this adds up really fast.

If 1000 coders use PH-Codex to write 100k lines of code a year (~300 lines of code/day), that’s $10 million in revenues. The models will exclusively be made available on Ocean/PH Marketplace, channelling all usage revenue to Ocean as Data-Token Consumption value.

Since training complex AI models is a black box task with a fair bit of uncertainty, we will conservatively estimate our chance of success at 50% with present funding.

Bang = $10m * 50% = $5m

Buck = $50,000

ROI = 100

Deliverables

[] Train two sizes of PH Codex - PH-Codex-M and PH-Codex-J, based on the spec above.

[] Publish PH-Codex-M on Ocean (Polygon) Marketplace

[] Continue building own marketplace with hardware for PH-Codex-J (⅔)

[] Complete documentation of the models published so far.

Team members

Dhruv Mehrotra

Role: Core developer - Python, Solidity

Relevant Credentials:

GitHub: dhruvluci · GitHub

LinkedIn: https://www.linkedin.com/in/dhruv-mehrotra-luci/

Gitcoin: @dhruvluci | Gitcoin

Background/Experience:

Co-founder/CEO, LUCI [AI information retrieval for enterprise]
Patented first Legal AI to clear the Bar Exam [2019].
Invented Bayesian Answer Encoding, state-of-the art in Open Domain QA in 2019.
Multiple hackathon winner and leading weekly earner, Gitcoin.

Hetal Kenaudekar

Role : Core developer - Solidity, JS, Frontend

GitHub : Aranyani01 · GitHub

LinkedIn : https://www.linkedin.com/in/hetal-kenaudekar-796715178/

Background/Experience:

Co-founder/COO, LUCI [AI information retrieval for enterprise]
Interface design, community engagement for various DeFi teams.
Solidity/JS/Frontend dev since early 2020, winner of multiple hackathons and grants.

External Links -

realdatawhale · October 6, 2021, 9:14am

As with other proposals here, your application surely aims to add a tremendous value to the quality of Ocean Market Data / Compute. Whilst I’m not qualified to judge your work fully, I would’ve loved to see at least 1 deliverable tailored towards achieving the objective outlined in the ROI.

What are your thoughts?

Cheers,
DW

yetti4 · October 7, 2021, 6:20am

Hi realdatawhale,

Thanks for your feedback - We’re totally open to what you’re suggesting, however I’m unsure how the ROI can be directly correlated to a deliverable - We have 2 deliverables regarding the publication of the 2 models on respective hardware - we see the usage and resultant ROI as the natural consequence of publishing very effective models.

The technical risk here is whether our models will be trained well enough to be heavily used. Our team is very experienced working with GPT-like models (right from GPT-1), and we’ve done extensive research to choose the best dataset, pertained checkpoints, and hyperparams to achieve optimal performance, comparable to OpenAI Codex. We’ve still very conservatively estimated our success probability at 50%, though we’re much more confident of succeeding.

Having said all this, feel free to suggest a deliverable that you think better captures our attempt to train and publish such models on Ocean

yetti4 · October 7, 2021, 6:23am

This paper examines the Codex training dataset & process in more detail - we will be using this data to fine-tune our model training process.

yetti4 · October 7, 2021, 6:30am

@realdatawhale Based on the context of your Discord chats, I suppose you were referring to a Deliverable focused directly on Onboarding customers. That makes sense. We are therefore adding the following 5th deliverable in addition to the 4 mentioned in the post:

[] Hackathon for devs using PH-Codex published on Ocean Market

Robin · October 7, 2021, 7:38am

Regarding your ROI calculation:
You are requesting 50.000$ which implies that you have already had 3+ previous grants funded but your BANG is 50.000$. In my opinion it should reflect all the funding and grants that you received already.
I think that it is ambitious to say that 1000 coders will spend 10.000$ each in one year to use a tool but potentially the time savings they gain will outweigh the cost.

Regarding using Ocean marketplaces compute-to-data infrastructure for your product:
I remember that we had this discussion in a previous round already where you also wanted to run your NLP models exclusively on the Ocean marketplace provided by Ocean Protocol. From my understanding this infrastructure is meant for demonstration purposes and not to be used as free infrastructure for the whole ecosystem to sell algorithms. To use your own infrastructure you have to create your own Ocean Provider instance that is connecting the compute-to-data offer on the marketplace to your servers. But maybe that is what you mean by “Continue building own marketplace with hardware for PH-Codex-J (⅔)”.

yetti4 · October 8, 2021, 8:46am

Yes Robin, that is what we mean by “Continue building own marketplace with hardware for PH-Codex-J (⅔)”.

I see that we’re in agreement of the potential of the technology once deployed on our own marketplace, and perhaps integrated with an ongoing data growth process like DataUnion. That remains our goal and we’re about 2 months away from it at most. Hope to count on your support

yetti4 · October 8, 2021, 4:34pm

Also, wrt ROI calculation, this is specific to the models we’re publishing this time. This is just the additional value to the general ROI calculated in our last proposal, which also still applies.

A generalised/full ROI calculation would therefore be as follows:

Buck: $100,000
Bang: $2.5m + $5m
Probability: 50%
ROI: $3.75m/100k = 37.5

Let me know if you have any other questions.

Trezor · October 10, 2021, 5:20pm

I like the overall direction but a request for 50,000 OCEAN after getting 4 previous grants averaging 14,000 OCEAN feels excessive. For this reason, I will vote no for this round and I hope that you re-submit next month for a more modest grant.