General Introduction
SemHash is a lightweight and flexible tool for dataset de-duplication by semantic similarity. It combines the fast embedding generation of Model2Vec with the efficient ANN (Approximate Nearest Neighbor) similarity search of Vicinity.SemHash supports both single dataset de-duplication (e.g., cleaning up the training set) and multi-dataset de-duplication (e.g., making sure that there is no overlap between the test and training sets). It is suitable for simple datasets, such as text lists, as well as more complex datasets, such as multi-column QA datasets. In addition, it includes the ability to check the de-duplication results, making it easier for you to understand and optimize the data cleaning process.
Function List
- Fast Embedding Generation: Use Model2Vec to generate embeddings quickly and accurately.
- Efficient Similarity Search: Efficient ANN similarity search with Vicinity.
- Single dataset de-duplication: clean up a single dataset to remove duplicate data.
- Multiple dataset de-duplication: Ensure that there is no overlap between multiple datasets to prevent data leakage.
- Multi-column dataset de-duplication: supports de-duplication of complex datasets, such as QA dataset.
- De-duplication Result Checking: Provide detailed checking function of de-duplication result to help optimize the data cleaning process.
Using Help
Installation process
- Open a terminal or command line tool.
- Enter the following command to install SemHash:
pip install semhash
Usage
Single dataset de-duplication
- Load the dataset:
from datasets import load_dataset
from semhash import SemHash
texts = load_dataset("ag_news", split="train")["text"]
- Initialize the SemHash instance:
semhash = SemHash.from_records(records=texts)
- De-weight the dataset:
deduplicated_texts = semhash.self_deduplicate().deduplicated
Multiple dataset de-duplication
- Load two datasets:
train_texts = load_dataset("ag_news", split="train")["text"]
test_texts = load_dataset("ag_news", split="test")["text"]
- Initialize the SemHash instance:
semhash = SemHash.from_records(records=train_texts)
- De-weight the test dataset:
deduplicated_test_texts = semhash.deduplicate(records=test_texts, threshold=0.9).deduplicated
Multi-column dataset de-duplication
- Load a multi-column dataset:
dataset = load_dataset("squad_v2", split="train")
records = [dict(row) for row in dataset]
- Initialize the SemHash instance:
semhash = SemHash.from_records(records=records, columns=["question", "context"])
- De-weight the dataset:
deduplicated_records = semhash.self_deduplicate().deduplicated
De-weighting results checking
- View the de-duplicated text:
result = semhash.self_deduplicate(records=texts, threshold=0.99)
for duplicate in result.duplicates:
print("RECORD:")
print(duplicate.records)
if duplicate.exact: print("Exact match!")
print("Exact match!")
print("Exact match!")
print("DUPLICATES:")
for corpus_duplicate in duplicate.duplicates: print(corpus_duplicate)
print(corpus_duplicate)
print("-" * 25)
With the above steps, you can quickly get started with SemHash for semantic de-duplication of datasets and improve the efficiency of data cleaning.
Detailed description of semhash
We are very excited to announce the release of semhash, our semantic de-duplication and dataset versatility tool (other features coming soon).
summary
A recent area of interest, especially in training Large Language Models (LLMs), is that while it's nice to have a lot of data, it's also nice to have a small amount ofhigher qualityThe data would be better. A good example of this can be found in the fineweb blog post found in , the authors started with a very large generic crawler dump dataset and performed many quality checks on that dataset, including de-duplication and a series of quality checks.
At Minish, we are interested in unlocking new possibilities by building very fast models. As you may know, we create the world's best and smallest fast models potion-base-8m. One of our areas of interest is approximate weight removal
: We want to remove semantically very similar documents from the corpus. Previous text de-duplication algorithms, such as minhash or simhash, operate on character or word n-tuple grammars, and thus can only find similarities between sequences of glyphically similar characters, ignoring semantic similarities.
While de-duplication sounds like something that can only benefit the training of large language models, it is also really useful for checking overlap in small datasets: even close overlap between the training and test sets can lead to overestimation of performance, and the presence of close duplicates in the training set can lead to wasted computational resources, overestimation of feature importance, and potentially other problems.
In addition, de-duplication techniques can be used to give you an overview of larger datasets: use the semhash
Checking for near duplicates takes only (milli)seconds and allows you to see which items in your dataset look similar. If these make sense: great! If there are no duplicates ...... is also great! Everything is better than training on incorrect data.
How do I use de-emphasis?
Here are some cool use cases to give you an idea of when it makes sense to do de-duplication:
categorization
As mentioned above, it is important that there is no overlap of information between the training and test sets. The presence of overlap usually means that you are overestimating performance because the model no longer needs to generalize to perform well. However, removing duplicates from the training set can also be very useful. Having a large number of duplicate terms with the same record in the training set can cause the model to overestimate the importance of that record's features, and in any case can lead to wasted computational resources and an overestimation of model fit.
RAG system
RAG Duplicate items in a system sound rare until you consider that most RAG systems are built using blocks: while having exact duplicates of documents may be rare, having duplicate blocks between or within documents is much more common. Having duplicate blocks in your knowledge base increases storage costs, increases the risk of retrieving irrelevant blocks, and forces you to implement a diversification strategy earlier than necessary.
Interpreting your corpus
By running with a low threshold semhash
This allows you to quickly see which documents are similar to which documents, and which documents are not similar. This gives you a good idea of what to focus on, what's missing from your data, and how your documents relate to each other.
How does it work?
semhash
The core of the model is to take as input a collection of strings or dictionaries. You first initialize the model with a set of reference documents and then use this set to de-duplicate the incoming set. Any incoming documents that are similar to documents in the reference set are removed and stored separately from their near-duplicates in the reference set.
from datasets import load_dataset
from semhash import SemHash
dataset = load_dataset("ag_news")
train = dataset["train"]
test = dataset["test"]
# This will create an index on your training set. All records are stored in full.
semhash = SemHash.from_records(records=train, columns=["text"])
# This will reference `train` to de-weight your text. Any items that appear in train will be
# be removed from test.
result = semhash.deduplicate(test, threshold=0.9)
# no duplicates in the set
result.deduplicated
# Duplicate items
result.duplicates
During fitting, all documents are first encoded by an encoder. The default encoder is potion-base-8mOne model2vec Models. These documents are then stored in the file provided by the usearch supportive vicinity in the vector store. Then, for the set of incoming documents, we first encode them using the specified encoder and then retrieve the nearest neighbors from the vector store. Each incoming document that has a nearest neighbor with similarity above a threshold is removed.
Since all these components are very fast, de-duplication of even very large datasets takes only a few minutes. For example, the de-duplication of the entire Squad-2.0 dataset The de-duplication of the dataset, which has 130,000 samples, takes only 7 seconds. This includes vectorization, fitting the indexes, and the actual de-duplication. Smaller datasets take a fraction of the time, and even datasets containing millions of documents take only a few minutes. For comprehensive benchmarking, see Our benchmarkingThe
interpretability
semhash
can also be used to investigate your dataset. This is accomplished by using the self_deduplicate
, you can de-emphasize the training set itself, and we'll use that as a starting point:
from datasets import load_dataset
from semhash import SemHash
dataset = load_dataset("ag_news")
train = dataset["train"]
test = dataset["test"]
# This will create an index on your training set. All records are stored in full.
semhash = SemHash.from_records(records=train, columns=["text"])
result = semhash.self_deduplicate(threshold=0.9)
Let's dive into what you can do with result
What to do. First, you can just get all the de-duplicated records:
result.deduplicated
These records are identical to the ones you put in, allowing you to use them in other ML pipelines semhash
Thesemhash
Doesn't change your data, it just reduces the size of the data.
You can easily see the percentage of duplicate records:
result.duplicate_ratio
or exact duplicate entries:
result.exact_duplicate_ratio
You can also see which items are marked as duplicates, and therationale. Each duplicate document is stored with the example in its index that caused it to be marked as a duplicate item. Exact duplicates are marked as exact duplicates. The following code example demonstrates basic usage.
for duplicated_record in results.duplicates: print(duplicated_record.record)
print(duplicated_record.record)
if duplicated_record.exact: print("Exact match")
print("Exact match")
continue
for index_duplicate in duplicated_record.duplicates: print(index_duplicate)
print(index_duplicate)
print("-" * 25)
For ease of use, we also provide a helper function that displays your set of duplicate itemsleastSimilar de-duplication records:
result.get_least_similar_from_duplicates(1)
If this record is still considered a duplicate item for its duplicates, then your duplicate strategy makes sense! If not, you can choose to reset the threshold for the result set. By doing so, you will create a new threshold that will remove duplicate entries. This is shown below:
print(result.duplicate_ratio)
result.rethreshold(0.95)
print(result.duplicate_ratio)
Thus, a generalized strategy might be to start with a relatively low threshold until the result.get_least_similar_from_duplicates
The results returned start to make sense. However, a threshold of 0.9 (which is the default) worked well in our experiments, but be sure to check your specific use case.
multicolumn data
semhash
Multi-column datasets are also supported, allowing you to de-duplicate datasets that contain text in multiple columns. For example, in a Q&A dataset, not only do you want to de-duplicate similar questions or similar contexts, but you also want to count as duplicates only items that are similar enough in both fields.
This is a difficult problem to solve, but semhash
That can be handled as well.
The following code snippet demonstrates how it works:
from datasets import load_dataset
from semhash import SemHash
dataset = load_dataset("rajpurkar/squad_v2")
train = dataset["train"]
# This will create an index on your training set. All records are stored in full.
semhash = SemHash.from_records(records=train, columns=["context", "question"])
result = semhash.self_deduplicate(threshold=0.9)
This calculates the similarity and only returns records where both fields are similar.