In a recent update, I reworked the crawler of CVE Crowd to list more posts, especially from platforms other than Mastodon.

This blog post goes into detail what conditions accounts and posts have to fulfill to be listed. By making the mechanics more transparent, I am trying to make it easier for you to adjust your own privacy settings on the Fediverse to either include or exclude your posts.

Federated Timeline of infosec.exchange

The Fediverse is an ensemble of social networks that can communicate with each other. The most popular platform in the Fediverse is Mastodon, which is similar to Twitter/𝕏. Other platforms are Lemmy (similar to Reddit), Pixelfed (similar to Instagram) and PeerTube (similar to YouTube).

As the largest Fediverse instance for IT security topics, I chose the Mastodon instance infosec.exchange as the starting point for the CVE Crowd crawler.

In the initial release, my crawler only used Mastodon’s search API. However, in some cases it did not seem to find all the posts it should. In particular, posts from other Fediverse platforms seemed to stay undiscovered.

So, I have now added the federated timeline of infosec.exchange as a data source. It contains all posts from accounts (of various platforms and instances) that are followed by an account on infosec.exchange.

Conditions

Discovered posts must meet a number conditions to be taken into account, which I’ll go through in the following sections.

1. CVE Number in Post

This one is pretty obvious: All posts are searched for CVE numbers. Posts that do not contain a CVE number will be discarded.

CVE numbers can be given either as hashtags or as plain text. I put effort in trying to consider all common and uncommon ways of writing CVE numbers.

Also, the CVE numbers must be valid, which means they must be registered by MITRE.

2. Account Attributes

Next, certain attributes1 of the account that wrote the post are evaluated. Currently these are the attributes bot, discoverable and noindex.

The following snippet shows the format of a Mastodon post, including information about the originating account.

// GET /api/v1/statuses/111301546671018547
{
	"id": "109551708143302638",
	"username": "kpwn",
	"display_name": "Konstantin :C_H:",
	"bot": false,
	"discoverable": true,
	"noindex": false,
}

Also, this is the only valid configuration for posts to be processed further. If you flip any of the Booleans, your posts will be discarded.

Here is a screenshot of the account settings in the web interface that represents the bot attribute:

Screenshot of the Public profile settings, sub-item Edit profile: This is an automated account represents the bot attribute.

The bot attribute in the web interface

And here is a screenshot of the respective settings for the discoverable and noindex attributes:

Screenshot of the Public profile settings, sub-item Privacy and reach: Feature profile and posts in discovery algorithms represents the discoverable attribute. Include profile page in search engines represents the (negated) noindex attribute.

The discoverable and noindex attribute in the web interface

Bot

This is an automated account
Signal to others that the account mainly performs automated actions and might not be monitored

As mentioned above, posts from bots will be discarded. I decided to do this because I want CVE Crowd to contain people’s opinions on CVEs, not recycled posts from bots, which often mirror news sites, RSS feeds or the like.

Discoverable

Feature profile and posts in discovery algorithms
Your public posts and profile may be featured or recommended in various areas of Mastodon and your profile may be suggested to other users.

This is my attempt to respect people’s privacy. I am aware that not everyone who posts to the Fediverse wants their posts to be publicly replicated on the Internet. Although this not explicit consent, I assume that people who have opted in to discovery features will be less concerned about having their posts listed on CVE Crowd.

Noindex

Include profile page in search engines
Your profile page may appear in search results on Google, Bing, and others.

This is another privacy setting. After all, CVE Crowd is indexable by search engines, and I want it to be. Since posts are on CVE Crowd for a limited time of 24 hours, I think it is unlikely that individual posts will actually be indexed. But you never know. So I figured: better safe than sorry.

3. Honored Hashtags

The hashtag #NoBot is widely used in the Fediverse to mark content that should not be processed by bots. My crawler honors the hashtag as well. If the account bio contains it, all posts from that account will be discarded. The same applies if it is used in the post itself.

To allow people to explicitly prevent CVE Crowd from crawling, I have also introduced the custom hashtag #CveCrowdDeny, which behaves identically.

4. Blocked Accounts

Of course, I reserve the right to exclude certain posts or accounts from CVE Crowd. Posts from accounts that have been blocked by my crawler will not be listed. I try to block as few accounts as possible. However, spammers or bots that are not marked as such will be blocked.

5. User Exclusive CVE Limit

Finally, I limit the number of User Exclusive CVEs. These are CVE columns that contain posts from only one account. Currently, the limit is set to five. This means that if you post about six different CVEs no one else has posted about, only five of those CVEs will be listed.

This is to prevent simple spamming attempts, and also to ensure that power users do not take up the entire page.

Wrapping Up

Thanks for your interest, and a special thanks to everyone who contributes to CVE Crowd by posting interesting news about recent CVEs.

Feel free to share this blog post with your colleagues and follow me on Mastodon!


  1. You can look up Mastodon’s account attributes here: https://docs.joinmastodon.org/entities/Account/ ↩︎