Gossipsub Tests

Hi everybody!

We’ve been heads down redesigning our methodology, topology, and test client for our gossipsub testing efforts in collaboration with some Adam Hanna from the Protocol Labs side.

The specific utilities and details of these tests are provided within this repository. As we continue to finalize the test definitions and get ready to run the tests, I’m starting this thread to solicit any feedback from the community. If you have the time and interest, please check out the repo and provide us with any relevant feedback in this thread so we can start an open discussion!

Thanks!

3 Likes

Great work on the gossipsub testing repo. In the tree topology configuration, you guys say that a node will be connected to a maximum of 4 nodes. Will the number of peers be selected randomly? The number of peers might have a big effect on the message redundancy. I’d suggest testing multiple tree topologies with different number of peers for each node, and maybe a totally random one.

Looks good. Perhaps it would be useful to test with and without message signing. In Eth2 we will not be using message signing.

Also, is there an easy way to test the rust implementation. What specifically would I need to build it fit it into these tests, or is go only compatible at the moment?

2 Likes

We, Raul K (@raulvk, Protocol Labs, libp2p) and I (@protolambda, EF Eth2 research), look forward to the gossipsub testing and appreciate the work so far, but have some concerns about the setup of the tests. Primarily about the testing network topology. Here is our feedback:


Whiteblock gossipsub tests feedback

Since gossipsub is not just being tested for organized and small networks, but for a random, large scale and decentralized network like Eth2, topology of network tests are a big concern.

We feel that topology properties where gossips make a difference, such as graph components with loops and routes with different limitations, are largely ignored.

Some of the libp2p performance and gossipsub properties may be derived from basic topologies still, but we simply cannot rely on just these for the full extent of the Eth2 gossipsub use cases.

Linear topology: The value of this test setup is not clear. It is unrealistic that nodes in a real p2p network will be daisy chaining with one another, so this test does not mimic a setup we are likely to encounter in production. Unfortunately all test series except for 10 use this topology, so they will produce mostly meaningless results.

It’s also unclear how measurements would be done, as the test executor selects message producers randomly. So the observations will be entirely different with each message, as messages propagate to the edges.

Tree structure: Similar to the older whiteblock tree topology referenced here, it limits messages to a single propagation path, completely diminishing the effects of gossip and putting nodes in stress-points unrealistically.

Fully connected: Again, the results of gossip are diminished because of every node only being one hop away from any information. Although the difference between mesh and fanout in gossipsub should affect this somewhat, it is mostly unrealistic.

We like to start a discussion on how to improve the topology and testing, before resources are wasted on running these existing topologies.

A simple start would be a network with each peer randomly connected to N other peers (both directions, i.e. not over-connect the initial nodes during generation). We are aware there are concerns with this too (randomness being the primary concern), and like to work to a solution for this, instead of resorting to oversimplification of the topologies. Gossipsub is a large effort, and there are definitely ways to formulate a good testing topology with joint network expertise.

  • Proto, Raul
3 Likes

Where will the nodes be hosted?

On one box? Multiple regions? I think that would be an interesting metric as we already see a lot of Eth1 nodes are living in the cloud.

I don’t understand the value of introducing randomness in the topology. What would be the point? What would be the metric of measure or criteria of evaluation? The effects of peer count on message propagation? Maybe I’m misunderstanding, but can you please elaborate?

In my opinion, any randomness in the topology would result in a non-deterministic dataset, rendering the results invalid and the tests themselves non-repeatable. It would make much more sense to make these observations in a topology where a node is connected to a definitive number of peers. By incrementally increasing the peer count within each test series, you can then reference the resulting datasets to identify any correlations between performance and behavior, but again, what are we trying to observe here?

We’re running the tests on the Genesis platform.

Thank you everyone for providing feedback on our methodology. We appreciate the time you put in.

@cemozer The nodes selected during the generation of the tree topology will be random, but it will be limited to 4 for each node. We will fix the seed and post the exact topology for each test. We intend to add an additional topology which will represent a more realistic topology with varying degrees per node. Please see our response below to @ protolambda.

@ AgeManning We’ll investigate whether we can enable signing with the current host implementation. But if Eth2 will not be using signing, would it be useful to test it? Right now, our test framework is specific to the Go host implementation. There has been some interest in comparing the go and rust implementation, and this could potentially be added to the next phase of this research effort.

@ raulk @protolambda

Linear topology: We didn’t make it clear in the readme, but the value of this topology would be to test for correctness of libp2p and our testing framework. For example, we found high CPU usage in the initial tests, and this topology will be more simple and deterministic to diagnose any CPU usage issues. You can think of it as an easy toy case to check if our test setup (and gossipsub) is correct. Additionally, a linear topology would test for a high diameter network under high network traffic. However, you make a good point that it’s unnecessary to be using this topology in all tests. Instead, we can limit the linear topology to Series 1: Control, Series 8: Message Size, and Series 9: Messages Sent (these can be our initial tests as well). We can then switch the other test series to use another topology (see the next discussion pieces).

Tree Topology: A tree is indeed an oversimplification without cycles, but the value of using a tree is that it is more deterministic in nature. Having a series of tests on a tree topology will allow for reproducible results and ease of analyzing potential weaknesses of the gossipsub protocol. We believe this topology, while unrealistic in nature, is still a valuable test.

Random Topology (New): The largest concern of generating a random topology would be that running a series of tests on a few random topologies may not result in statistically significant results. A single benchmark test could run similar to a fully connected network while another random topology would produce results similar to tree topology. For each set of results, we would not have anything to compare it to. In order to capture gossipsub’s performance in the face of random topologies, a very large set of topologies would need to be tested. This would then bring challenges of resources. That is, running enough tests will require a much longer time frame as well as significantly higher cloud costs. We could, however, attempt to generate a pseudorandom topology using the Barabasi-Albert model (known for its approximation of the Internet) with an arbitrary parameter until it exhibits cycles and connectedness (not all random topologies will be connected and dropping generated topologies can skew statistical results), and use that for testing.

Fully Connected: This topology is intended to be use for benchmarking. Our current plans only include one test using a fully connected network, which will not be too difficult for us in terms of resources. We also believe this can potentially stress test the caching (queueing) mechanism in gossipsub since all nodes will have a peer set larger than the gossipsub degree parameter.

Hey Jason, sounds like what you are designing are libp2p stack benchmark tests, but the name of the effort is misleading. I believe that many of the objections are due to the fact that a lot of us assumed that this effort was gossipsub specific. In that case, we suggest that we split the convo into the respective topics. Testing the different layers of the libp2p stack vs testing gossipsub.

1 Like

@jonny-rhea our setup tests an implementation running the gossipsub protocol using a raw libp2p-host implementation. The host based implementation has a lot of configurable options, but we’re focusing on gossipsub.

In order for gossipsub tests to be useful, the test runs should produce enough data to answer a variety of questions with respect to eth2 networking and message propagation time assumptions (e.g. throughput requirements implied by slot times and message sizes).

Specifically, we’re looking to understand how messages sent on the following types of topics propagate:

  • One to Many Topics: e.g. proposing a block
  • Many to Many Topics: e.g. attesting to blocks

Analysis of Variance

Its important to ensure that the measured effects of a particular trial run isn’t due to chance.

  • How many trial runs are there going to be in order to estimate variance? To ensure statistically significant results, we can calculate the number of runs needed using a Power Analysis.

Relevant Data to Capture (Design Vars)

Each design variable should have a small, medium, large value and should be tested using a full factorial design matrix.

  • Network Latency
  • Packet Loss
  • Bandwidth
  • Total Nodes (i.e. network size)
  • Message Size
  • Number of Connections per Node

Additional Questions

  • How do you intend to test large network sizes?
  • How do you intend to estimate network size (Total Nodes)?
    • For example, there won’t necessarily be a 1 to 1 ratio of validators to physical nodes on the network.
3 Likes

I understand, but your replies above (specifically with respect to the Linear Topology) imply that you are trying to isolate another part of the libp2p stack that has nothing to do with gossipsub

1 Like

The performance of any pubsub protocol is intrinsically related to many other factors, for example peering topology, security strategy, etc. Certain metrics are more effected than others. Off the top of my head, last delivery hop is especially effected by peering topology, for example.

Agreed, but in this case I would isolate the issues in other parts of the stack before testing a large and complex piece such as gossipsub (as @jonny-rhea has previously suggested.)

Topology is of course of utter importance, but gossipsub builds its own overlay and has its own routing logic that doesn’t map directly to the underlying physical topology, which in this case complicates testing of gossipsub itself.

I agree with Raul and Protos assessment, the proposed topologies add little value to the intended tests at this stage, randomizing connections seems to be a good compromise to get started and build a good baseline. More complicated topologies can and should be tested but as a separate and more focused effort.

I’m also extremely interested in the metrics and approach suggested by @jonny-rhea in his previous post.

3 Likes

One way to isolate high cpu usage in gossipsub itself, is to run it in isolation without any other components, mocking anything that doesn’t belong. This is very easy to simulate as a simple unit test.

@protolambda profiled the gossip protocol and found that by far the largest hotspots were all crypto signing / verifying related:

Good find! Thanks! I feel like if the individual parts
of the stack haven’t been properly profiled and benchmarked, then this testing will unfortunately end up being pointless. I’m sure @raulk has more input from the Go implementation perspective in terms of profiling and known hotspots/bottlenecks. Regardless, this is not what this effort is about, am I right?

1 Like

By random, I assumed they meant psuedo random and seeded so that repeated runs are deterministic. Admittedly, the first time I looked at the tree topology the picture made me think there was a choke point in the middle. After reading it again, it actually seems okay.

With respect to @raulk’s comments about two way connections… I feel like that is implied since connections are implicitly two-way. Am I off base in that assumption?

Even if truly random, with enough runs, some sort of a distribution should form. For lack of a better phrase, that distribution should be deterministic. Meaning, if you run 1000 (some large number) of random peering topologies and I run 1000 (some large number) we should both get roughly the same distribution.

1 Like

@jasonatran Answering to the broader themes here; not responding to the details of your remarks on the topologies. I can do so if you insist in retaining these topologies despite opinions expressed by collaborators in this thread. But I’m hoping Whiteblock will reconsider the setup for more accurate, useful, and conclusive tests.

What people in this thread are trying to convey is that there is no point in forming structured topologies to then initiate messages from random peers in the topology, because each message will perform differently. A structured topology is biased in some manner by definition, and it’ll always perform better when your behaviour is aligned with that bias. For example, in a tree topology, where A is the root and X, Y, Z are leaves, initiating a broadcast from Z will have very different (worse) characteristics than initiating it from A, and AFAIK you’re not controlling for this bias. As a result, messages published by Z will be unfairly and artificially penalised in your measurements.

Gossipsub has been built for unstructured p2p networks. The ETH2 network will also be unstructured. By creating artificial topologies here, you are both: (a) testing something under a setting it was not optimised for, and (b) testing something in a way that it will never be used in real life (in ETH2). And by deduction, you the outcome of such testing will not be valuable to the libp2p community, nor to the ETH2 researchers.

Additionally, by forming predefined topologies ahead of time, you’re also leaving the dynamic grafting, pruning and gossiping behaviour out of the picture. I’d really like those aspects to be part of the benchmark.

Personally, I’d like to see a single randomised topology construction where the # of connections is a variable between [4,16] with steps of 2, assuming D=6 and Dhi=12. And I’d like to see 100 runs at each step, measuring variance, in order to normalise for the randomisation.

2 Likes