SemHash: Fast implementation of semantic text de-duplication to improve data cleaning efficiency-Chief AI Sharing Circle

🚀 Invitation to Experience: China's First AI IDE Intelligent Programming Software Trae Chinese version downloadThe DeepSeek-R1 and Doubao-pro are available for unlimited use!

General Introduction

SemHash is a lightweight and flexible tool for dataset de-duplication by semantic similarity. It combines the fast embedding generation of Model2Vec with the efficient ANN (Approximate Nearest Neighbor) similarity search of Vicinity.SemHash supports both single dataset de-duplication (e.g., cleaning up the training set) and multi-dataset de-duplication (e.g., making sure that there is no overlap between the test and training sets). It is suitable for simple datasets, such as text lists, as well as more complex datasets, such as multi-column QA datasets. In addition, it includes the ability to check the de-duplication results, making it easier for you to understand and optimize the data cleaning process.

Function List

Fast Embedding Generation: Use Model2Vec to generate embeddings quickly and accurately.
Efficient Similarity Search: Efficient ANN similarity search with Vicinity.
Single dataset de-duplication: clean up a single dataset to remove duplicate data.
Multiple dataset de-duplication: Ensure that there is no overlap between multiple datasets to prevent data leakage.
Multi-column dataset de-duplication: supports de-duplication of complex datasets, such as QA dataset.
De-duplication Result Checking: Provide detailed checking function of de-duplication result to help optimize the data cleaning process.

Using Help

Installation process

Open a terminal or command line tool.
Enter the following command to install SemHash:

   pip install semhash

Usage

Single dataset de-duplication

Load the dataset:

   from datasets import load_dataset
from semhash import SemHash
texts = load_dataset("ag_news", split="train")["text"]

Initialize the SemHash instance:

   semhash = SemHash.from_records(records=texts)

De-weight the dataset:

   deduplicated_texts = semhash.self_deduplicate().deduplicated

Multiple dataset de-duplication

Load two datasets:

   train_texts = load_dataset("ag_news", split="train")["text"]
test_texts = load_dataset("ag_news", split="test")["text"]

Initialize the SemHash instance:

   semhash = SemHash.from_records(records=train_texts)

De-weight the test dataset:

   deduplicated_test_texts = semhash.deduplicate(records=test_texts, threshold=0.9).deduplicated

Multi-column dataset de-duplication

Load a multi-column dataset:

   dataset = load_dataset("squad_v2", split="train")
records = [dict(row) for row in dataset]

Initialize the SemHash instance:

   semhash = SemHash.from_records(records=records, columns=["question", "context"])

De-weight the dataset:

   deduplicated_records = semhash.self_deduplicate().deduplicated

De-weighting results checking

View the de-duplicated text:

   result = semhash.self_deduplicate(records=texts, threshold=0.99)
for duplicate in result.duplicates:
print("RECORD:")
print(duplicate.record)
if duplicate.exact:
print("Exact match!")
else:
print("DUPLICATES:")
for corpus_duplicate in duplicate.duplicates:
print(corpus_duplicate)
print("-" * 25)

With the above steps, you can quickly get started with SemHash for semantic de-duplication of datasets and improve the efficiency of data cleaning.

Detailed description of semhash

We are very excited to announce the release of semhash, our semantic de-duplication and dataset versatility tool (other features coming soon).

summary

A recent area of interest, especially in training Large Language Models (LLMs), is that while it's nice to have a lot of data, it's also nice to have a small amount ofhigher qualityThe data would be better. A good example of this can be found in the fineweb blog post found in , the authors started with a very large generic crawler dump dataset and performed many quality checks on that dataset, including de-duplication and a series of quality checks.

At Minish, we are interested in unlocking new possibilities by building very fast models. As you may know, we create the world's best and smallest fast models potion-base-8m. One of our areas of interest is 近似去重: We want to remove semantically very similar documents from the corpus. Previous text de-duplication algorithms, such as minhash or simhash, operate on character or word n-tuple grammars, and thus can only find similarities between sequences of glyphically similar characters, ignoring semantic similarities.

While de-duplication sounds like something that can only benefit the training of large language models, it is also really useful for checking overlap in small datasets: even close overlap between the training and test sets can lead to overestimation of performance, and the presence of close duplicates in the training set can lead to wasted computational resources, overestimation of feature importance, and potentially other problems.

In addition, de-duplication techniques can be used to give you an overview of larger datasets: use the semhash Checking for near duplicates takes only (milli)seconds and allows you to see which items in your dataset look similar. If these make sense: great! If there are no duplicates ...... is also great! Everything is better than training on incorrect data.

How do I use de-emphasis?

Here are some cool use cases to give you an idea of when it makes sense to do de-duplication:

categorization

As mentioned above, it is important that there is no overlap of information between the training and test sets. The presence of overlap usually means that you are overestimating performance because the model no longer needs to generalize to perform well. However, removing duplicates from the training set can also be very useful. Having a large number of duplicate terms with the same record in the training set can cause the model to overestimate the importance of that record's features, and in any case can lead to wasted computational resources and an overestimation of model fit.

RAG system

RAG Duplicate items in a system sound rare until you consider that most RAG systems are built using blocks: while having exact duplicates of documents may be rare, having duplicate blocks between or within documents is much more common. Having duplicate blocks in your knowledge base increases storage costs, increases the risk of retrieving irrelevant blocks, and forces you to implement a diversification strategy earlier than necessary.

Interpreting your corpus

By running with a low threshold semhashThis allows you to quickly see which documents are similar to which documents, and which documents are not similar. This gives you a good idea of what to focus on, what's missing from your data, and how your documents relate to each other.

How does it work?

semhash The core of the model is to take as input a collection of strings or dictionaries. You first initialize the model with a set of reference documents and then use this set to de-duplicate the incoming set. Any incoming documents that are similar to documents in the reference set are removed and stored separately from their near-duplicates in the reference set.

from datasets import load_dataset
from semhash import SemHash
dataset = load_dataset("ag_news")
train = dataset["train"]
test = dataset["test"]
# 这会在您的训练集上创建一个索引。所有记录都完整存储。
semhash = SemHash.from_records(records=train, columns=["text"])
# 这会参考 `train` 对您的文本进行去重。任何在 train 中出现的项目都会
# 从 test 中删除。
result = semhash.deduplicate(test, threshold=0.9)
# 没有重复的集合
result.deduplicated
# 重复项
result.duplicates

During fitting, all documents are first encoded by an encoder. The default encoder is potion-base-8mOne model2vec Models. These documents are then stored in the file provided by the usearch supportive vicinity in the vector store. Then, for the set of incoming documents, we first encode them using the specified encoder and then retrieve the nearest neighbors from the vector store. Each incoming document that has a nearest neighbor with similarity above a threshold is removed.

Since all these components are very fast, de-duplication of even very large datasets takes only a few minutes. For example, the de-duplication of the entire Squad-2.0 dataset The de-duplication of the dataset, which has 130,000 samples, takes only 7 seconds. This includes vectorization, fitting the indexes, and the actual de-duplication. Smaller datasets take a fraction of the time, and even datasets containing millions of documents take only a few minutes. For comprehensive benchmarking, see Our benchmarkingThe

interpretability

semhash can also be used to investigate your dataset. This is accomplished by using the self_deduplicate, you can de-emphasize the training set itself, and we'll use that as a starting point:

from datasets import load_dataset
from semhash import SemHash
dataset = load_dataset("ag_news")
train = dataset["train"]
test = dataset["test"]
# 这会在您的训练集上创建一个索引。所有记录都完整存储。
semhash = SemHash.from_records(records=train, columns=["text"])
result = semhash.self_deduplicate(threshold=0.9)

Let's dive into what you can do with result What to do. First, you can just get all the de-duplicated records:

result.deduplicated

These records are identical to the ones you put in, allowing you to use them in other ML pipelines semhashThesemhash Doesn't change your data, it just reduces the size of the data.

You can easily see the percentage of duplicate records:

result.duplicate_ratio

or exact duplicate entries:

result.exact_duplicate_ratio

You can also see which items are marked as duplicates, and therationale. Each duplicate document is stored with the example in its index that caused it to be marked as a duplicate item. Exact duplicates are marked as exact duplicates. The following code example demonstrates basic usage.

for duplicated_record in results.duplicates:
print(duplicated_record.record)
if duplicated_record.exact:
print("Exact match")
continue
for index_duplicate in duplicated_record.duplicates:
print(index_duplicate)
print("-" * 25)

For ease of use, we also provide a helper function that displays your set of duplicate itemsleastSimilar de-duplication records:

result.get_least_similar_from_duplicates(1)

If this record is still considered a duplicate item for its duplicates, then your duplicate strategy makes sense! If not, you can choose to reset the threshold for the result set. By doing so, you will create a new threshold that will remove duplicate entries. This is shown below:

print(result.duplicate_ratio)
result.rethreshold(0.95)
print(result.duplicate_ratio)

Thus, a generalized strategy might be to start with a relatively low threshold until the result.get_least_similar_from_duplicates The results returned start to make sense. However, a threshold of 0.9 (which is the default) worked well in our experiments, but be sure to check your specific use case.

multicolumn data

semhash Multi-column datasets are also supported, allowing you to de-duplicate datasets that contain text in multiple columns. For example, in a Q&A dataset, not only do you want to de-duplicate similar questions or similar contexts, but you also want to count as duplicates only items that are similar enough in both fields.

This is a difficult problem to solve, but semhash That can be handled as well.

The following code snippet demonstrates how it works:

from datasets import load_dataset
from semhash import SemHash
dataset = load_dataset("rajpurkar/squad_v2")
train = dataset["train"]
# 这会在您的训练集上创建一个索引。所有记录都完整存储。
semhash = SemHash.from_records(records=train, columns=["context", "question"])
result = semhash.self_deduplicate(threshold=0.9)

This calculates the similarity and only returns records where both fields are similar.

SemHash: Fast implementation of semantic text de-duplication to improve data cleaning efficiency