There is a saying that is often used as justification for
the madness of kings. “Absolute power corrupts absolutely.” The costs of
maintaining control pushes one to reason that one has a duty to extract every
last bit of profitability or utility off of it. Rather than growing the kingdom
and profiting from the established means of reward, a king may choose the local
minima of exploiting those whom give him power in order to have a secure
reward. Modern systems put absolute power in the hands of the person who owns a
website. This is partially because it becomes necessary for the author of the
website to provide computer hosting for the website. This is both a blessing
and a curse.
With power comes the ability to find novel ways to profit
from the system in ways outside of the site’s social contract. Even if the
system only needs to save a few bits of information about each user to provide
service, the system may need to see much more about the user in order to create
the inserted data or to create an interface for the user. The user has no idea
how much of their data is saved. If they consent to tracking, it tends to be a
single opt-in which gives a blanket pardon for collecting all of this data.
On the negative side, power comes with the ability to harm
without retribution. Through misrepresentation or denial of service, the server
can decide which users should be ostracized. Major social networks are accused
of selectively displaying certain kinds of information in order to engineer a
worldview in their users. Users point at the identity representation that the
application has and says “this is me,” while having no final say on what that
representation holds. This is much like a King’s authority to edit the
society’s stories to be more flattering.
All of this power comes at a cost. On the internet you may
be a king, but someone with a warrant can compel you to use this power in their
interests. The more popular your service becomes, the more likely it is that
attackers will find any vulnerability you have. The legacy of kings is one of
anxiety and bloodshed for a reason: nature abhors a power vacuum.
Golem: Trustless, Stateful P2P Application Backends
The problem of power centralization is caused by placing the
duty of data storage and manipulation in the hands of the maintainer-owned
server. The Internet at large has been trying to fix this through peer-to-peer
systems since the days of Napster. We’ve never quite gotten it right though.
The core issue is one of abuse and data integrity. If
anybody can control your database, it becomes impossible to ensure that a bad
actor doesn’t remove all of your records and destroy your system. An easy
answer is to use encryption to make sure that an updater “owns” the data to be
inserted. This works for messaging and file-sharing systems. This is not
sufficient in most systems because the data will typically hold semantics that
constrain the values. It is the application code’s sequencing of updates that constrains
data transitions.
If the data-manipulating application code is allowed to run
on the client machines, it becomes increasingly difficult to ensure that the
requests made by the client are requests that are formed by the application.
Furthermore, clients have a copy of the code to reverse-engineer and copy. This
is the real motivation behind client-server web apps. If data-manipulating
application code is run by the maintainer of the website, then they are in a
position to be forced to eavesdrop or manipulate it.
What if somebody else ran the code responsible for
validating and altering the database? It’s not intuitive why this would improve
anything. This person would see the client’s intermediate sensitive data, could
use it to generate an invalid data alteration, can reverse-engineer the code,
and has less investment in correctness than a good user and a good maintainer.
The Golem project arose from a desire to counteract each of
these problems through knowledge limitation, intimate accounting of state
changes, pervasive rate-limiting, and cryptographically-validated
identities.
Architecture:
Standard Backend Dataflow:
1. Users query DNS to find a request-responding server behind an endpoint (usually through a load-balancer) and send the API request with arguments to the work server.
2. This work server can contact other work servers in the cluster, as well as database servers. The can make IO at any point during execution, though database manipulation is typically within a transaction. These servers will validate the input and will use it to craft these database updates.
3. These servers will all manipulate a database, relying on indexes for performance.
4. The user has no access to the database.
5. User gets back a response from their endpoint request.
6. A system looks at the database periodically and creates structures to allow the maintainer to query “business information” out of.
Golem turns the MVC app inside out by making the client the
intermediary between the application servers and the database.
In Golem:
1. The user queries a cluster for the last version of a document from the maintainer describing the endpoint structure.
2. The user interacts with the database before contacting the work server. They already know the data the endpoint needs to view, and they pass the data to the endpoint after fetching it. Subsequent queries can replay this fetched data.
3. The user looks at their endpoint structure to see which primitive values (Strings, Integers, etc) must be seen at the same time. These can be for reasons of string comparison, concatenation, predicates, etc. If the data needs to be seen in multiple places, these are considered different observation points. The user counts the number of observation points and chooses a server for each. No server appears twice, and the client maximizes geographic distribution of requests. The user uses the public key in the published document to sign a document containing a decryption key and an encryption key. The decryption key for the server is used by the client to encrypt the observation point. This document prevents the work server from needing to keep track of keys, and removes a need for an unbounded number of negotiation requests.
4. The user sends the request document and database information to a server not chosen yet. This server can see data structures tracking encrypted mini-blobs of data.
5. Work servers validate that the fetched data was inserted before the last database synchronization point by validating the location of its hash in the global Merkle tree.
6. Work servers send fragments of the input to the servers chosen for each function to service, and get a response from them. From the responses, the return value is crafted. Database updates are threaded through in a single transaction.
7. The client examines the database updates to see if the work servers stored information in the updates that they are not alright with. If they consent to the data storage, they push the updates to the keys they control, along with the verification that the data followed a valid path through the program.
The above summary sped through a lot of implementation
details, but gives the reader an idea of how the data flows through the system
and some of the logic behind the architecture. We will turn to the data
structures of the system next, as they dictate the operations of the system.
Data Structures:
The database model that we present is based on the Kappa
Architecture (http://milinda.pathirage.org/kappa-architecture.com/).
The Kappa Architecture is a view of databases which arose
from the rise of the distributed log. In distributed systems such as databases,
we find that a sane cache eviction policy and cluster consistency are difficult
to extract from an existing database. As more and more clusters turn to the use
of a distributed log for synchronization, its application to databases becomes
more attractive. The idea behind the Kappa Architecture is to insert all
modifications into a shared log and to create a database by traversing the log
from start to finish and executing updates. In this way, the database is a
temporary cache of your real data store. Updates can be dropped and the index
can be rebuilt, giving your database a lot of flexibility.
Right now, let’s focus on the actual structure of the
database. As we said above, the database is a log. Queries are made from the “index”
structures which are created by traversing the update log. This means that all
we have to do is to offer a distributed log, right? Not so fast. We don’t want
everybody appending to the same log, as this allows clients to interfere with
one another. Each client has one log per piece of relevant data, and the
program maintainer is responsible for providing snapshots and for creating the
indices that the rest of the cluster uses.
Our append-only logs are quite simple. We chose to use the
bitcoin blockchain and the BitTorrent swarm as the database. A user can create
a transaction including OP_RETURN, which allows them to embed information in
the blockchain. This creates a nice linear ordering to updates globally. The
embedded information includes the hash of a torrent containing the insertion
commands to enter into the log.
Finding the newest version of a value at a key then becomes
finding the last transaction from a given bitcoin address, and fetching the
torrent. This has latency, but clients should never have to query their stream
directly. Instead, the Supervisor-made indices exist for primary database
querying.
The question is still begged; how do we know that the data
in our database is consistent. Can we afford to trust anyone? By using what I’ve
termed a “prime stamp,” the supervisor of the system can ensure that
transactions only occur along the paths that the codebase would at first.
Prime stamps work because modulo is cheap but prime
factorization is hard. Every transaction-verification window, all updates in
the database are given a large prime to tag it. When a server adds an update to
the updates for the client to insert into the database, they multiply a “state”
number by the prime they’re responsible for. They also sign the number with
their private key.
This state is threaded through all requests. The client
picks a random number and sends it to the endpoint. From there on out, the servers
thread this random number through and multiply it by large primes. The client
puts this accumulated stamp into a transaction block in a torrent that is pushed
to each bitcoin “stream” which has a key which is modified by the transaction.
If a transaction stamp divides the accumulation, then the
request was crafted by the servers with very high probability. The thing though
is that these stamps are only published after a time-period snapshot is taken.
That means that an attacker would need to do statistical profiling of many
updates made from clients which hit the same endpoint, and find the GCD of
these primes. Now not all servers doing the same update will necessarily have
the same prime tag. Duplication and versioning allows for a lot of code path obfuscation.
The attacker needs to try the
permutation of the primes found.
How many times can an attacker get feedback on a prime stamp
being right? Well, they can only get it wrong once. And when they do, they lose
the proof-of-work necessary for a server to get an identity and to get trusted
with code. It’s also worth noting that this factoring and guessing must be done
during a single time cycle. At a very small scale, the problem still becomes
infeasible when the cycle time is less than a half hour or so.
Protocol
Terminology
Blockchain – The BitCoin blockchain
Supervisor – The agent running the service, foots the server’s
costs
Server – The agent hosting code for one or more Applications
Client – The agent being serviced by the Supervisor’s
application code, running on numerous Servers
Prime stamp – A product of primes representing a database
transaction.
Supervisor Bootstraps Application
1.
Supervisor generates a master public/private
keypair and a secondary public/private keypair. They use the master keypair to
sign a certificate chain for the secondary keypair. This allows the supervisor
to place the secondary keypair on a quasi-untrusted server, and to use the
master keypair to revoke or grant this and future secondary keys.
2.
Supervisor takes program description (described
below), and posts a job offer document to the blockchain containing a type tag
and a torrent hash of the offer. The offer contains specifics like the pay rate
per request, the minimum number of servers needed to start serving, the
BitTorrent tracker, and the expected client and server proof of work
difficulties (to prevent denial-of-service).
3.
Supervisor waits until a sufficient number of
servers have agreed to participate in the same half-hour time window.
4.
Supervisor begins the “Tick” action until the
number of servers drops too low.
5.
If the Supervisor wishes to update the code,
they will post a new job bid which references the previous one and wait until
the next “Tick” for things to propagate.
Server Joins Application
1.
Server sees the job offer document type on the
blockchain, and fetch the torrent
2.
Server decides to accept job
3.
Server does Proof-of-Work to create document
accepting work. This is their identity certificate. It includes a public key
and an available hostname and port to access them on.
4.
Server posts data to a torrent, puts hash in the
blockchain. Server sends Supervisor smallest possible transaction.
5.
Server and Supervisor begin a micro-transaction.
Client Reserves Database Key
1.
Client picks a new bitcoin address and pushes a
commit with an initial value for the value held there.
2.
The client then refers to this new bitcoin
address in a transaction from the address used to register the Client. This
includes the type tag of the data stored here. Datatypes must be types listed
in the endpoint specification.
3.
The client public key is the only key that can
write to it.
Client Makes Endpoint Request
1.
Client uses Tick document, examines endpoint
request to find out the arguments to pass to the endpoint, and how many
different servers are necessary to service it.
2.
If endpoint needs a database query to work on,
Client uses the Supervisor’s index to find the data necessary to service.
3.
Client uses Tick document, finds a server for
each entry-point hash in the document. For each server, client generates two
symmetric keys and encrypts them both with the public key of the server. They
then encrypt the data that only this server can see. This prevents the need for
key negotiation to happen in the middle of servicing a request.
4.
Client picks a “top-level” server with the
endpoint’s hash, one which won’t be able to see any primitives, and encrypts
the entire message with public key for this server.
5.
Client sends the entire payload to the “top-level”
server and waits for a response
6.
Client vets the server updates to make for
information leakage.
7.
Client inserts updates into keys by posting
torrent of diff of updates and inserting hash into blockchain.
Supervisor Tick
The tick
contains data for the last cycle as well as for the next 3 “tick” cycles. We
have the 3-cycle delay because this gives the servers the time to get the next
cycle loaded. It’s cheap to offer the buffer here. It’s worth noting that as
this is a torrent file, servers and clients can download individual files out
of it. For this reason, many of the fields below are separate files in the
torrent.
Summary Portion:
1.
Examine the blockchain for update hashes pushed
by tracked client addresses.
2.
Concatenate these blocks together and embed into
document.
3.
Rebuild indexes from data changes and embed into
document. This is important because while the logs would be O(N) to traverse on
every read, we can create many cached index structures to amortize update
costs.
4.
Post the collection of transaction stamps valid
during this time period. It’s vital to post this while hashing the current
values, or clients could retroactively use the stamps.
5.
Receive server logs, and pay server via micro-transaction
for work done.
Next Portion:
1.
Check that the minimum number of servers posted
in the contract have been offered. If not, sleep for time cycle.
2.
Check if clients have sent any transactions
signed by servers that are not accepted transactions. Use known primes for that
cycle to check the factorization of the transaction, O(N) in terms of number of
mutations. Can do SIMD. Use the signature order to find out first who didn’t
add the prime they should have, and eject them from the cluster by placing a “tombstone”
in the blockchain for their previous server proof-of-work.
3.
Check the gossip documents sent, see if servers
have been below latency promised latency for a fraction of last few cycles. If
so, cancel contract. Require a number of cycles before giving them another
chance. Exponential back off on chances.
4.
Find and shuffle primes for each database
mutation call site
5.
For each transaction control flow that’s
acceptable, compute the prime stamp with the new primes.
6.
Shuffle and probabilistically duplicate and
version endpoint functions, and distribute to nodes.
7.
Encrypt each for each server and append code
with “tick” time for it to run (3 in future). Each endpoint code is a separate
file in the torrent.
8.
Append schema with documentation -> endpoint
message, including the primitive encryption that has to be done for each
endpoint. In this way, the API is self-documenting.
Lastly, sign this document and push
to BitTorrent, put hash on blockchain.
Server Response to Tick
1.
Replace last set of code and prime keys with new
set
2.
Fetch next 3-offset set of code and keys and
prepare.
3.
Encrypt list of client interaction receipts with
Supervisor’s public key and place in torrent. Send hash of torrent to
Supervisor in exchange for a microtransaction update to the new owed balance.
Client Response to Tick
1.
Validate that all commits made were not poisoned
by servers. If so, notify Supervisor. This can be done lazily, on the next
query.
Attacks:
There are not many attack options available to any actors
here. All would amount to a punishable violation of trust with no lasting
impact beyond loss of peers.
If the client tries to insert data into the database without
using the application to create the transactions, then the servers will not
have signed the transactions and created the stamp. This kind of bad behavior
is easy to spot. We will destroy the client’s identity with the system.
Proof-of-work makes this attack unfeasible; the attacker will do much more work
than anybody else.
If a server returns an incorrect set of transactions, we
know. If a server wants to eavesdrop on users, they don’t have access to enough
information to learn more than a small set of primitives. Furthermore, the fact
that the client chooses the servers means that an attacker would have to
control most of the network to see more than one chunk of primitives.
Denial of service is countered by the pervasive rate
limiting that proof-of-work and the blockchain brings.
Lastly, what if a server knew the transaction stamps and
forged a bad transaction? Well what would this entail? They can’t use the
already-released transaction stamps, as they’re out of date. They’d need to control
enough servers to understand all of the control flow and see all of the tags.
The ability of clients to enforce geographic distribution means that seizing
enough computers rapidly enough is unfeasible.
Code Specification:
In order for our system to be usable, a specific interface
has to be presented to the application developer. If we allowed for arbitrary
looping, it would be possible for functions to route to each other such that
the cluster would loop indefinitely. Rather than take on the halting problem
through heuristics, a more principled approach is to ban unrestricted looping
entirely.
Wait, you say, how can I do useful work without looping? While
we are reducing the number of total applications that can be written in theory,
we are not reducing the number of useful applications in practice. The
application servers shouldn’t be making IO requests to the world, that would be
a design which makes subverting the Golem system trivial. Instead, the IO is
carried out by the clients and the database values and request data are sent to
the servers.
The servers are a purely functional mapping of finite,
pre-parsed data structures into finite data structures. There is no need to
introduce arbitrarily diverging computation into the system. Instead, we will
expose the primitives of map, filter, and fold while banning iteration keywords
and both recursion and mutual recursion. The functional programming idea of a
zipper extends these to trees. With trees, we gain key-value stores. The
primitives are therefore sufficiently flexible for reasonable programs, and
allow us t
Rather than writing a new language myself, we will lay out a
design for an ORM-like library which can build these programs up. By using opaque
types for “Golem values,” the functions can build up programs through working
with values as if they were database objects. The program written by the developer
is actually ran once, on the developer’s machine. There is nothing hiding in
unexecuted conditionals: all Golem conditional statements are executed because
they are essentially code generation functions. This program outputs a JSON
document specifying the Golem program created. The time daemon will accept this
program and will carry out the supervisor’s roles.
Consider a function that receives endpoint requests. The
argument to this request is a JSON object which contains primitive (String,
Integer, etc) values, and nested JSON dictionaries. The program that the
developer takes will register these endpoint functions. In this way, the
request object is “injected” into the function. We exploit this inversion of
control to pass a JSON object which has the same structure as the expected
format, but passes “opaque types” in the place of primitives.
In order to see the value in an opaque type, you must pass
it a callback. This is where the program jumps to another server. As Javascript
is fairly used to callback-driven workflows, there are many libraries to
present syntactic sugar here. These callbacks build a value that is of an opaque
type. This return value will look like the value returned to the API consumer
but something far weirder is actually happening here. Each opaque type is
really a reference to the end of linked list of updates to the initial value it
has. As the opaque types combine, the lists simply interleave. The return value
of the endpoint function is thus a dataflow log of how the output is yielded
from the input.
At points, the functions will make database queries. These
queries are executed at “compile-time”, if you recall. We can therefore track
messages sent to the database library, and build up an ordering of queries to
insert into every update log. This visibility allows us to “hoist” all queries
to the top of an entry-point block. Updates append to the update log, which are
returned to the client.
In the case where queries require generated values, we will
notify the developer at compile-time that we had to split the function. This
does mean more state visibility, but this is only a problem when the application
wants to hide database access. This privilege is an anti-pattern that we’ve
explicitly negated in the design of Golem.
After arriving at a dataflow representation of the program,
the wire format becomes a series of calls and statements to make to yield the
arguments for the eventual data constructor returned to the person hitting the
endpoint, along with the update log.
Previous Work
The core of the idea comes from onion routing. Nobody can
see enough of the entire global state to understand enough to do harm. By the
time information is obtainable, it is too late for it to be useful. The fact
that obfuscation can truly work when you have safety in numbers led to the
architecture of the application server cluster.
I was also inspired by https://github.com/elendirx/web2web
. The method of using the blockchain as a carrier for an individual stream of
updates, and using that to serve BitTorrent hashes, proved instrumental to the
ability to host coordination documents in a verifiable way.
Implementation Difficulty
I hope to see an implementation of this architecture done by
the end of this semester. Thankfully, not much really has to be written.
The infrastructure for Server execution is really limited to
an interpreter of map, filter, and fold structured recursion, as well as pattern
matching on data constructors and creating them. This would be an undertaking
for a new language, but since this is a high-level library we manage to elide
most of the issues. Something that would be fun to play with one day is using
Tensorflow as the execution framework. Tensorflow is a dataflow execution
framework and offers both the needed primitives and the ability to dynamically
optimize around actual control flow. A JIT which specializes on executed Golem
function will definitely help with the costs of interpretation, as endpoints
have an enforced structure.
The Client work is really just a combination of
off-the-shelf parts. Web2web shows that webtorrent and blockchain scraping is
fairly trivial to implement with the existing libraries. The only thing that
remains is to create a wrapper around the database fetching and the server
selection for the query. This logic is fairly uniform, so there shouldn’t be
much cyclomatic complexity here.
The Supervisor daemon will be the largest amount of labor,
but is limited to the summary document creation, the prime-stamp logic, seeding
of torrents necessary to the application, and code transformation. The code transformation
should be fairly easy to work with since we are really just creating a logging
library that looks like an ORM.
Conclusion
I believe that Golem represents a fairly significant contribution
to the problem of trust in website hosting and administration. By treating an
application as a conversation between the client and the code author, we are able
to use the servers as an untrusted third party. The code author defines the
legal application states, and the swarm of clients and servers work to ensure
that the database only contains data generated by valid transactions. The
history of database transactions is used to create a performant, indexed data
structure for clients to read from.
It’s worthwhile to consider the multitude of recent crimes
and coercions that would have been unable to occur in this system. Hackers have
very little to hack. If they hack a server, all they can do is get the server
ejected from the cluster. If they hack a client, all they can do is impersonate
the client; this is an attack that is outside of our scope. If the hack a
supervisor, the supervisor can revoke the secondary keypair and user the master
key to re-assert a valid index structure and cluster code distribution. The
only program state that really exists is the append-only log of commits from
the clients. Everything else is a constructed view of the data that can be re-created
when necessary. State can’t be lost, and there is no state to really steal
since the non-persistent data is distributed throughout the entire cluster.
By trusting nothing beyond the difficulty of prime
factorization, we have created a system in which trust cannot be stolen or
abused to do lasting harm. That is an Internet that I’d like to be a citizen
of.
As soon as I noticed this website I went on reddit to share some of the love with them avigate here
ReplyDeleteThis is one awesome article post. Fantastic 49S RESULTS
ReplyDelete.
I believe you have remarked some very interesting points , thanks for the post.
If you are dentist and looking for new patient in your clinic then you need to try our dental seo service to getting new patient.
ReplyDeleteOur EPSON printer customer care number + 1-(888) 536-4219 is one of the most sought after. For our Epson service team has worked 24×7 to fix all the technical issues in the most efficient and advanced way.
ReplyDeleteIt is critical to recall that during the game, the primary concern is to stop as expected so as not to lose all the cash or not go into a misfortune
ReplyDeleteWise Registry Cleaner Pro Crack
Positive site, where did u come up with the information on this posting?
Tuneup Utilities Pro Crack
I'm pleased I discovered it though,
Safe365 SD Card Data Recovery Wizard Crack
ill be checking back soon to find out what additional posts you include
MKVToolNix Crack
The aftereffect of the wager is consistently speedy, for instance, a single move of dice,
Cyberlink PowerDVD Crack
Regardless, everybody can locate an appropriate kind of betting for themselves.
A real article for the time being. Great way to put together. Keep it up!.
ReplyDeletehttps://www.aarinkaur.com/
https://www.aarinkaur.com/Model escorts in Mumbai.html
https://www.aarinkaur.com/Bombay Escorts.html
https://www.aarinkaur.com/Female escorts in Mumbai.html
https://www.aarinkaur.com/Call Girls in Mumbai.html
Your post is helping me a lot. Its really nice and epic. Thanks a lot for the useful info on this topic. You did it so much well. I love to see more about GBWhatsApp. Keep sharing and updating. Also share more posts with us. Thank you.
ReplyDelete