[Submitted on 15 May 2023 (v1), last revised 18 May 2023 (this version, v2)]
Abstract: Recent research has suggested that there are clear differences in the
language used in the Dark Web compared to that of the Surface Web. As studies
on the Dark Web commonly require textual analysis of the domain, language
models specific to the Dark Web may provide valuable insights to researchers.
In this work, we introduce DarkBERT, a language model pretrained on Dark Web
data. We describe the steps taken to filter and compile the text data used to
train DarkBERT to combat the extreme lexical and structural diversity of the
Dark Web that may be detrimental to building a proper representation of the
domain. We evaluate DarkBERT and its vanilla counterpart along with other
widely used language models to validate the benefits that a Dark Web domain
specific model offers in various use cases. Our evaluations show that DarkBERT
outperforms current language models and may serve as a valuable resource for
future research on the Dark Web.
Submission history
From: Youngjin Jin [view email]
Mon, 15 May 2023 12:23:10 UTC (15,872 KB)
[v2]
Thu, 18 May 2023 05:02:29 UTC (15,872 KB)