Categorizing the current state of the internet is one thing (see domain mirroring), keeping up with it is another. Auto-categorization helps you discover and categorize new sites based on anonymized usage data. We already run this on our central instance, so with domain mirroring enabled you get the benefits without the work. However, running it on your own instance as well can provide quicker and more relevant results to your users.
Auto-categorization uses both textual and visual machine learning on site data obtained using headless chrome browsers run as kubernetes jobs. This keeps sites isolated from one another while also ensuring their resources are constrained. Visual machine learning is used to detect adult content on sites, and is included in the classifier. The textual machine learning model must be trained according to your use-case. Additionally, images from sites can be sent to project arachnid to scan for CSAM content.
Warning While the categorizer makes every effort to avoid exposing you, as the classifier, to harmful content; there is always a risk. View data obtained from crawled sites at your own risk.
Training of the textual machine learning model is automated entirely from the admin app. To begin, we’ll have to label some sites to use for training. If you followed the getting started guide, your domain database probably looks something like this:
View/Edit
google.com
and nytimes.com
to add them to the training set for their respective categories:
We will want a real example for the Adult Sites
category. Add a well-known domain and add it to the training set:
Go to the Domains -> Browse
page and filter by “Used for training”. Your training set should now look something like this:
Feel free to add more domains or categories. Before we train the model, we need to enable the auto-categorization deployments by setting the following in our values.yaml
:
providers:
storage:
model-test:
platform: redis
internal: true
bucketHint: ss_apps
secret: |
expiry: '' # no expiry
resources-test:
platform: redis
internal: true
bucketHint: ss_apps
secret: |
expiry: 24h
categorizer:
adminApp:
storage:
enabled: true
provider: model-test
crawlerStorage:
enabled: true
provider: resources-test
autoCat:
enabled: true
crawlerWorker:
enabled: true
jobs:
storage:
enabled: true
provider: resources-test
This example stores the machine learning model and crawl resources in redis, which is temporary storage. In production, you should use real object storage as categorizer.adminApp.storage.provider
so your trained machine learning model is persistent. See the providers section in the full values.yaml file for examples. However, you can use redis as crawler storage if you have no requirement to keep the crawl data longer than needed.
In production, you may also wish to run the crawlerWorker
in a separate cluster to the rest of your categorizer. This is for a few reasons:
crawlerWorker
in a public cluster is ideal. Your main cluster may be a private cluster, which will have only 1 egress IP.The crawlerWorker
communicates through the adminApp
only through redis, which is used as a work queue. So you could, for example, edit redis.internal.serviceType
,serviceAnnotations
, and serviceAllowedSourceRanges
in the cluster hosting redis to make it available to the second cluster, then use redis.external
in the second cluster to connect to it. Or, create your own redis instance and connect both to it using redis.external
.
We’ll proceed with the simple example for now, using temporary storage in a single cluster. Deploy the above, then run kubectl get pods
. You should see the crawler worker now:
NAME READY STATUS RESTARTS AGE
safesurfer-crawler-worker-c67877594-xv5q6 1/1 Running 0 63s
The crawler worker creates new pods for each crawl in the cluster and namespace that it’s deployed in. To start traing a new model, navigate to Domains -> ML Models
in the admin app. Then, hit Train new model
.
Then, select “Add & Train” at the bottom. A new model should appear. Hit the “View Progress” button:
If you run kubectl get pods
, you should see that 3 crawler pods have been created:
NAME READY STATUS RESTARTS AGE
crawler-job-46b1d3fe-d343-44d6-81dc-b6099f5eef92-j8xd2 1/1 Running 0 67s
crawler-job-745b704b-0232-4630-974b-961e4b4e6bfb-7gks9 1/1 Running 0 67s
crawler-job-def2793d-8e9e-4f6a-9163-8d0bfaed9f1e-qtgjc 1/1 Running 0 67s
Once they finish, the model will train and be saved. Refresh the ML models page and make the model active once it’s done:
Now let’s set up the rest of the pipeline.
In the cluster hosting the admin app, update the following:
categorizer:
adminApp:
newDomainNotifierUser:
enabled: true
queueOnly: false
user: New-Domain-Notifier
accessKey: generate-a-password
autoCat:
classifier:
enabled: true
classifierStorage:
provider: model-test
crawlerStorage:
provider: resources-test
adultCategoryID: 1 # Change if your "Adult" category has a different ID
Then update the release. After running kubectl get pods
, you should eventually see:
NAME READY STATUS RESTARTS AGE
safesurfer-classifier-df9b5fcf6-5l525 1/1 Running 0 4m14s
Now that we have a classifier, we just need to tell the DNS to send unknown domains to the API of our admin app. In the cluster hosting DNS, or your ss-config
values, update the following:
dns:
newDomainNotifier:
enabled: true
categorizer:
user: New-Domain-Notifier
key: generate-a-password # The password you generated for the newDomainNotifierUser in the previous code block
addConfig:
queue: false
crawling:
enabled: true
Warning So that we see results instantly, we’re setting
queue
to false. This can open your postgres database to DDOS, deliberate or not, since all new domains will be added to it.
Then upgrade the Chart or re-image the server. If using k8s, you will see something like the following after running kubectl get pods
:
NAME READY STATUS RESTARTS AGE
safesurfer-new-domain-notifier-67fb55499f-r6nr2 1/1 Running 0 80s
The DNS will re-rollout also. Once done, any unknown domain should be sent to the categorizer for classification. To test it, let’s try doing a dig
for a suspicious domain:
dig @$DNS_IP xvideos.com
After a minute, you should see that the domain has now been added:
A crawl should have been created for the domain. Refresh until it shows Success
:
After the crawl is successful, the classification will start. This may take a few minutes. You can wait for it by refreshing the Domains -> Suggestions
screen. Once done, you should get a suggestion like this:
Note that News
is being suggested for the account screens because they aren’t explicit and we haven’t given the model a better alternative category. Suggestions will improve with more training data. Feel free to confirm/deny the suggestions.
As mentioned earlier, it isn’t recommended to try and auto-categorize every new site that comes in. To limit the amount of domains added, you can enable the new domain queue and update the config of the new domain notifier to use it. In the cluster hosting the admin app, update the following keys:
categorizer:
adminApp:
newDomainNotifierUser:
queueOnly: true
autoCat:
addFromDNS:
enabled: true
categorizerAccessKey: generate-a-password
After upgrading the deployment, you should see something like the following:
NAME READY STATUS RESTARTS AGE
safesurfer-new-domain-queue-56ddc7f84d-lbftg 1/1 Running 0 111s
The new domain queue will store and prioritize queued domains. Now in the cluster hosting the DNS, or your ss-config
values, update the following keys:
dns:
newDomainNotifier:
addConfig:
queue: true
Upgrade the deployment or re-image the server. Once done, new domains should be added to the queue. The best way to test the new domain queue is real traffic, but for now we’ll look up another new domain and wait:
dig @$DNS_IP twitter.com
The new domain queue, in its default configuration, analyzes and adds 10 domains every 10 minutes. If you’re just using the test traffic, you should see it log something like this after a few minutes:
2023/03/28 02:03:32 [info] Analyzing 1 domains
2023/03/28 02:03:37 [info] Added 1 domains successfully
Then, a crawl and eventual classification should appear:
The suggestion here is obviously not correct, but there was no training data to suit this domain. The better quality the training data, the better quality results you will get in suggestions.
By hitting “View” on the suggestion, you can see the data that was gathered from the site and used to make a classification, including a full-page screenshot:
You can view blurred versions of any of the images to handle sensitive content:
In its default form, the crawler doesn’t handle the international internet, unless you incorporate international sites into the training data. You can remedy this by adding a translation provider to the crawler:
providers:
translation:
gcpexample:
platform: gcp
maxLen: 5000
secret: |
credentials: |
{
"type": "service_account",
"project_id": "example-project-id-123",
"private_key_id": "example",
"private_key": "-----BEGIN PRIVATE KEY-----\nexample\n-----END PRIVATE KEY-----\n",
"client_email": "example@example-project-id-123.iam.gserviceaccount.com",
"client_id": "123",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/example%40example-project-id-123.iam.gserviceaccount.com"
}
project: example-project-id-123
categorizer:
autoCat:
crawlerWorker:
jobs:
translation:
enabled: true
provider: gcpexample
dns:
newDomainNotifier:
addConfig:
crawling:
config:
translate: true
detectLanguage: true
homeLanguage: en
# OR the following to discard international pages
# translate: false
# detectLanguage: true
# homeLanguage: en
To add CSAM checking, obtain a project arachnid API key and add it to the classifier:
categorizer:
autoCat:
classifier:
csamChecker:
enabled: true
# Create a new category (blocked by default) and add the ID here. It is recommended to turn off "Suggestions Only" for autocat
# on this category to immediately block any domains with detected CSAM.
categoryID: 0
apiKey: my-api-key
The classifier will now check images against the project arachnid API.