AI Personal Learning
and practical guidance

Efficient chunking of complex text structures in documents with 50 lines of regular expressions

Xiao Han, CEO of Jina, shared an impressive code snippet on GitHub for the core chunking implementation used in the Jina tokenizer. The regular expression code snippet uses just over 50 lines, but efficiently handles chunking text content of all complexities. The performance is surprisingly robust.

 


50 lines of regular expressions to achieve efficient chunking of complex document formatting-1

Online experience: https://jina.ai/tokenizer/

 

// Updated: Aug. 15, 2024
// Run: node testRegex.js testText.txt
// Used in https://jina.ai/tokenizer
const fs = require('fs');
const util = require('util');
// Define variables for magic numbers
const MAX_HEADING_LENGTH = 7;; const MAX_HEADING_CONTENT_LENGTH = 7; // Define variables for magic numbers.
const MAX_HEADING_CONTENT_LENGTH = 200; const MAX_HEADING_CONTENT_LENGTH = 1; // Define variables for magic numbers.
const MAX_HEADING_CONTENT_LENGTH = 200; const MAX_HEADING_UNDERLINE_LENGTH = 200; const MAX_HTML_HEADING_LENGTH = 200
const MAX_HTML_HEADING_ATTRIBUTES_LENGTH = 100;
const MAX_LIST_ITEM_LENGTH = 200;
const MAX_NESTED_LIST_ITEMS = 6;
const MAX_LIST_INDENT_SPACES = 7; const MAX_BLOCKQUOTE = 7; const MAX_LIST_INDENT_SPACES = 7
const MAX_BLOCKQUOTE_LINE_LENGTH = 200; const MAX_BLOCKQUOTE_LINE_LENGTH = 200; const
const MAX_BLOCKQUOTE_LINE_LENGTH = 200; const MAX_BLOCKQUOTE_LINES = 15;
const MAX_CODE_BLOCK_LENGTH = 1500;
const MAX_CODE_LANGUAGE_LENGTH = 20;
const MAX_INDENTED_CODE_LINES = 20;
const MAX_TABLE_CELL_LENGTH = 200;
const MAX_TABLE_ROWS = 20; const MAX_HTML_TABLE_ROWS = 20; const MAX_HTML_ROWS = 20
const MAX_HTML_TABLE_LENGTH = 2000; const MIN_HORIZONTAL_CELL_LENGTH = 20; const
const MIN_HORIZONTAL_RULE_LENGTH = 3; const MAX_SENTENCE_LENGTH = 3; and
const MAX_SENTENCE_LENGTH = 400; const MAX_QUOTED_TABLE_LENGTH = 20; const
const MAX_QUOTED_TEXT_LENGTH = 300; const
const MAX_PARENTHETICAL_CONTENT_LENGTH = 200;
const MAX_MATH_BLOCK_LENGTH = 500; const MAX_PARAGRAPH_PARENTHESES = 5
const MAX_PARAGRAPH_LENGTH = 1000;
const MAX_STANDALONE_LINE_LENGTH = 800;
const MAX_HTML_TAG_ATTRIBUTES_LENGTH = 100;
const MAX_HTML_TAG_CONTENT_LENGTH = 1000; const LOOKAHEAD_LINE_LENGTH = 800; const
const LOOKAHEAD_RANGE = 100; // Number of characters to look ahead for a sentence boundary
// Define the regex pattern
// Headings
// Headings
// List items
// Block quotes
// Code blocks
// Tables
// Horizontal rules
// Standalone lines or phrases
// Sentences or phrases
// Quoted text, parenthetical phrases, or bracketed content
// Paragraphs
// HTML-like tags and their content
// LaTeX-style math expressions
// Fallback for any remaining content
// Read the regex and test text from files
const chunkRegex = new RegExp(
"(" +
// 1. Headings (Setext-style, Markdown, and HTML-style, with length constraints)
`(? :^(? :[#*=-]{1,${MAX_HEADING_LENGTH}}|\\w[^\\r\\\n]{0,${MAX_HEADING_CONTENT_LENGTH}}\\\r\\\\\n[-=]{2,${MAX_HEADING_ UNDERLINE_LENGTH}}|<h[1-6][^>]{0,${MAX_HTML_HEADING_ATTRIBUTES_LENGTH}}&gt;)[^\\r\\n]{1,${MAX_HEADING_CONTENT_LENGTH}}(? :</h[1-6]>)? (? :\\r?\\\n|$))` +
"|" +
// New pattern for citations
`(? :\\\[[0-9]+\\\\][^\\\r\\\\n]{1,${MAX_STANDALONE_LINE_LENGTH}}})` +
"|" +
// 2. List items (bulleted, numbered, lettered, or task lists, including nested, up to three levels, with length constraints)
`(? :(? :^|\\\r?\\n)[\\\t]{0,3}(? :[-*+-]|\\d{1,3}\\\\. \\\w\\\\. |\\\\\\[[ xX]\\\\])[ \\\\\t]+(? \(xX]\])[\\t]+(? :\\b[^\\\r\\n]{1,${MAX_LIST_ITEM_LENGTH}}\b(? :[.!? ...]|\\\\. {3}|[\\u2026\u2047-\u2049]|[\\p{Emoji_Presentation}\\p{Extended_Pictographic}])(? =\\\s|$))|(? :\\\b[^\\\r\\n]{1,${MAX_LIST_ITEM_LENGTH}}\b(? =[\\\r\\n]|$))|(? :\\\b[^\\\r\\n]{1,${MAX_LIST_ITEM_LENGTH}}\\b(? =[.!? ...]|\\\\. {3}|[\\\u2026\u2047-\u2049]|[\\\p{Emoji_Presentation}\\p{Extended_Pictographic}]) (? :. {1,${LOOKAHEAD_RANGE}}(? :[.!? ...]|\\\\... {3}|[\\u2026\u2047-\u2049]|[\\p{Emoji_Presentation}\\p{Extended_Pictographic}])(? =\\\s|$))?))) ` +
`(? :(? :\\\r?\\\n[ \\\t]{2,5}(? :[-*+-]|\\\d{1,3}\\\... \\\w\\\\. |\\\\\[[ xX]\\\\])[ \\\\t]+(? \(xX]\])[\\t]+(? :\\b[^\\\r\\n]{1,${MAX_LIST_ITEM_LENGTH}}\b(? :[.!? ...]|\\\\. {3}|[\\u2026\u2047-\u2049]|[\\p{Emoji_Presentation}\\p{Extended_Pictographic}])(? =\\\s|$))|(? :\\\b[^\\\r\\n]{1,${MAX_LIST_ITEM_LENGTH}}\b(? =[\\\r\\n]|$))|(? :\\\b[^\\\r\\n]{1,${MAX_LIST_ITEM_LENGTH}}\\b(? =[.!? ...]|\\\\. {3}|[\\\u2026\u2047-\u2049]|[\\\p{Emoji_Presentation}\\p{Extended_Pictographic}]) (? :. {1,${LOOKAHEAD_RANGE}}(? :[.!? ...]|\\\\... {3}|[\\u2026\u2047-\u2049]|[\\p{Emoji_Presentation}\\p{Extended_Pictographic}])(? =\\\s|$)))?))))) ` +
`{0,${MAX_NESTED_LIST_ITEMS}}(? :\\r?\\\n[ \\\t]{4,${MAX_LIST_INDENT_SPACES}}}(? :[-*+-]|\\\d{1,3}\\\... \\\w\\\. |\\\\[[ xX]\\\])[ \\\t]+(? \(xX]\])[\\t]+(? :\\b[^\\\r\\n]{1,${MAX_LIST_ITEM_LENGTH}}\b(? :[.!? ...]|\\\\. {3}|[\\u2026\u2047-\u2049]|[\\p{Emoji_Presentation}\\p{Extended_Pictographic}])(? =\\\s|$))|(? :\\\b[^\\\r\\n]{1,${MAX_LIST_ITEM_LENGTH}}\b(? =[\\\r\\n]|$))|(? :\\\b[^\\\r\\n]{1,${MAX_LIST_ITEM_LENGTH}}\\b(? =[.!? ...]|\\\\. {3}|[\\\u2026\u2047-\u2049]|[\\\p{Emoji_Presentation}\\p{Extended_Pictographic}]) (? :. {1,${LOOKAHEAD_RANGE}}(? :[.!? ...]|\\\\... {3}|[\\u2026\u2047-\u2049]|[\\p{Emoji_Presentation}\\p{Extended_Pictographic}])(? =\\\s|$)))?))))) ` +
`{0,${MAX_NESTED_LIST_ITEMS}}))?) ` +
"|" +
// 3. Block quotes (including nested quotes and citations, up to three levels, with length constraints)
Block quotes (including nested quotes and citations, up to three levels, with length constraints) :(? :^&gt;(? :&gt;|\\\s{2,}){0,2}(? The following is a list of the three levels with length constraints `(? :\\b[^\\\r\\n]{0,${MAX_BLOCKQUOTE_LINE_LENGTH}}\b(? :[.!? ...]|\\\\. {3}|[\\u2026\u2047-\u2049]|[\\p{Emoji_Presentation}\\p{Extended_Pictographic}])(? =\\\s|$))|(? :\\\b[^\\\r\\n]{0,${MAX_BLOCKQUOTE_LINE_LENGTH}}\\b(? =[\\\r\\n]|$))|(? :\\\b[^\\\r\\n]{0,${MAX_BLOCKQUOTE_LINE_LENGTH}}\\b(? =[.!? ...]|\\\\. {3}|[\\\u2026\u2047-\u2049]|[\\\p{Emoji_Presentation}\\p{Extended_Pictographic}]) (? :. {1,${LOOKAHEAD_RANGE}}(? :[.!? ...]|\\\\... {3}|[\\u2026\u2047-\u2049]|[\\p{Emoji_Presentation}\\p{Extended_Pictographic}])(? =\\\s|$))?)))) \\\r?\\\n?){1,${MAX_BLOCKQUOTE_LINES}})` +
"|" +
// 4. Code blocks (fenced, indented, or HTML pre/code tags, with length constraints)
Code blocks (fenced, indented, or HTML pre/code tags, with length constraints) `(? :(? :^|\\\r?\\n)(? :\`\`\\`|~~~~)(? :\\w{0,${MAX_CODE_LANGUAGE_LENGTH}})? \\r?\\\\n[\\\s\\\S]{0,${MAX_CODE_BLOCK_LENGTH}}? (? :\`\`\`|~~~)\\\r?\\\\n?\+
`|(? :(? :^|\\\r?\\\n)(? : {4}|\\\t)[^\\\r\\\n]{0,${MAX_LIST_ITEM_LENGTH}}(? :\\r?\\\n(? : {4}|\\\t)[^\\\r\\\\n]{0,${MAX_LIST_ITEM_LENGTH}}){0,${MAX_INDENTED_CODE_LINES}}}\\r\\\\n?)` +
`|(? :<pre>(?:<code>)? [\\s\\\S]{0,${MAX_CODE_BLOCK_LENGTH}}? (? :</code>)?</pre>))` +
"|" +
// 5. Tables (Markdown, grid tables, and HTML tables, with length constraints)
Tables (Markdown, grid tables, and HTML tables, with length constraints) :(? :^|\\\r?\\n)(? :\\\|[^\\\\r\\\n]{0,${MAX_TABLE_CELL_LENGTH}}\\|(? :\\\r?\\\n\\\\|[-:]{1,${MAX_TABLE_CELL_LENGTH}}\\|){0,1}(? :\\r?\\\n\\\\|[^\\\r\\n]{0,${MAX_TABLE_CELL_LENGTH}}\\|){0,${MAX_TABLE_ROWS}}` +
ðŸñ'ðŸñ'ðŸñ'ðŸñ'ðŸñ<table>[\\s\\\S]{0,${MAX_HTML_TABLE_LENGTH}}?</table>))` +
"|" +
// 6. Horizontal rules (Markdown and HTML hr tags)
`(? :^(? :[-*_]){${MIN_HORIZONTAL_RULE_LENGTH},}\\s*$|<hr\\s*/?>)` +
"|" +
// 10. Standalone lines or phrases (including single-line blocks and HTML elements, with length constraints)
`(? :^(? :: ^(?<[a-zA-Z][^>]{0,${MAX_HTML_TAG_ATTRIBUTES_LENGTH}}&gt;)? (? :(? :[^\\r\\n]{1,${MAX_STANDALONE_LINE_LENGTH}}(? :[.!? ...]|\\\\\. \\\\...\...\...\...\. \\\...\...\...\...\... |[\\u2026\\\\u2047-\u2049]|[\\p{Emoji_Presentation}\\\p{Extended_Pictographic}])(? =\\\s|$))|(? :[^\\\r\\n]{1,${MAX_STANDALONE_LINE_LENGTH}}(? =[\\\r\\\\n]|$))|(? :[^\\\r\\\n]{1,${MAX_STANDALONE_LINE_LENGTH}}(? =[.!? ...]|\\\\. \\\\...\...\...\...\. \\\...\...\...\...\... |[\\u2026\\\\u2047-\u2049]|[\\p{Emoji_Presentation}\\\p{Extended_Pictographic}])(? :. {1,${LOOKAHEAD_RANGE}}(? :[.!? ...]|\\\\. \\\\...\...\...\... \\\...\...\...\...\... |[\\u2026\\\\u2047-\u2049]|[\\p{Emoji_Presentation}\\\p{Extended_Pictographic}])(? =\\\s|$))?)))) (? :</[a-zA-Z]+>)? (? :\\r?\\\n|$))` +
"|" +
// 7. Sentences or phrases ending with punctuation (including ellipsis and Unicode punctuation)
Sentences or phrases ending with punctuation (including ellipsis and Unicode punctuation) Sentences or phrases ending with punctuation (including ellipsis and Unicode punctuation) `(? :[^\\r\\n]{1,${MAX_SENTENCE_LENGTH}}(? :[.!? ...]|\\\\. \\\\...\...\...\...\. \\\...\...\...\...\... |[\\u2026\\\\u2047-\u2049]|[\\p{Emoji_Presentation}\\\p{Extended_Pictographic}])(? =\\\s|$))|(? :[^\\\r\\n]{1,${MAX_SENTENCE_LENGTH}}(? =[\\r\\\n]|$))|(? :[^\\\r\\\n]{1,${MAX_SENTENCE_LENGTH}}(? =[.!? ...]|\\\\. \\\\...\...\...\...\. \\\...\...\...\...\... |[\\u2026\\\\u2047-\u2049]|[\\p{Emoji_Presentation}\\\p{Extended_Pictographic}])(? :. {1,${LOOKAHEAD_RANGE}}(? :[.!? ...]|\\\\. \\\\...\...\...\... \\\...\...\...\...\... |[\\u2026\\\\u2047-\u2049]|[\\p{Emoji_Presentation}\\\p{Extended_Pictographic}])(? =\\\s|$))?))) ` +
"|" +
// 8. Quoted text, parenthetical phrases, or bracketed content (with length constraints)
"(? :" +
`(?<!\\w)\"\"\"[^\"]{0,${MAX_QUOTED_TEXT_LENGTH}}\"\"\"(?!\\w)` +
`|(?<!\\w)(?:['\"\`'"])[^\\r\\n]{0,${MAX_QUOTED_TEXT_LENGTH}}\\1(?!\\w)` +
`|\\([^\\r\\n()]{0,${MAX_PARENTHETICAL_CONTENT_LENGTH}}(?:\\([^\\r\\n()]{0,${MAX_PARENTHETICAL_CONTENT_LENGTH}}\\)[^\\r\\n()]{0,${MAX_PARENTHETICAL_CONTENT_LENGTH}}){0,${MAX_NESTED_PARENTHESES}}\\)` +
`|\\[[^\\r\\n\\[\\]]{0,${MAX_PARENTHETICAL_CONTENT_LENGTH}}(?:\\[[^\\r\\n\\[\\]]{0,${MAX_PARENTHETICAL_CONTENT_LENGTH}}\\][^\\r\\n\\[\\]]{0,${MAX_PARENTHETICAL_CONTENT_LENGTH}}){0,${MAX_NESTED_PARENTHESES}}\\]` +
`|\\$[^\\r\\n$]{0,${MAX_MATH_INLINE_LENGTH}}\\$` +
`|\`[^\`\\r\\n]{0,${MAX_MATH_INLINE_LENGTH}}\`` +
")" +
"|" +
// 9. Paragraphs (with length constraints)
`(?:(?:^|\\r?\\n\\r?\\n)(?:<p>)? (? :(? :[^\\r\\n]{1,${MAX_PARAGRAPH_LENGTH}}(? :[.!? ...]|\\\\. {3}|[\\u2026\u2047-\u2049]|[\\p{Emoji_Presentation}\\p{Extended_Pictographic}])(? =\\\s|$))|(? :[^\\\r\\n]{1,${MAX_PARAGRAPH_LENGTH}}(? =[\\\r\\\\n]|$))|(? :[^\\\r\\\n]{1,${MAX_PARAGRAPH_LENGTH}}(? =[.!? ...]|\\\\. {3}|[\\\u2026\u2047-\u2049]|[\\p{Emoji_Presentation}\\p{Extended_Pictographic}])(? :. {1,${LOOKAHEAD_RANGE}}(? :[.!? ...]|\\\\... {3}|[\\u2026\u2047-\u2049]|[\\p{Emoji_Presentation}\\p{Extended_Pictographic}])(? =\\\s|$))?)))) (? :</p>)? (? =\\\r?\\\n\\\\r?\\n|$))` +
"|" +
// 11. HTML-like tags and their content (including self-closing tags and attributes, with length constraints)
`(? :: The following are examples of HTML-like tags and their content (including self-closing tags and attributes, with length constraints)<[a-zA-Z][^>]{0,${MAX_HTML_TAG_ATTRIBUTES_LENGTH}}(? :&gt;[\\s\\S]{0,${MAX_HTML_TAG_CONTENT_LENGTH}}?</[a-zA-Z]+>|\\s*/&gt;))` +
"|" +
// 12. LaTeX-style math expressions (inline and block, with length constraints)
`(? :(? :\\\$\\$[\\s\\S]{0,${MAX_MATH_BLOCK_LENGTH}}? \\\\$\\$)\(? :\\\$[^\\$\\r\\n]{0,${MAX_MATH_INLINE_LENGTH}}\\\$))` +
"|" +
// 14. Fallback for any remaining content (with length constraints)
`(? :(? :[^\\r\\n]{1,${MAX_STANDALONE_LINE_LENGTH}}(? :[.!? ...]|\\\\... {3}|[\\u2026\u2047-\u2049]|[\\p{Emoji_Presentation}\\p{Extended_Pictographic}])(? =\\\s|$))|(? :[^\\\r\\n]{1,${MAX_STANDALONE_LINE_LENGTH}}(? =[\\\r\\\\n]|$))|(? :[^\\\r\\\n]{1,${MAX_STANDALONE_LINE_LENGTH}}(? =[.!? ...]|\\\\... {3}|[\\\u2026\u2047-\u2049]|[\\p{Emoji_Presentation}\\p{Extended_Pictographic}])(? :. {1,${LOOKAHEAD_RANGE}}(? :[.!? ...]|\\\\... {3}|[\\u2026\u2047-\u2049]|[\\p{Emoji_Presentation}\\p{Extended_Pictographic}])(? =\\\s|$))?))) ` +
")",
"gmu"
).
// read from the arg[1] file
const testText = fs.readFileSync(process.argv[2], 'utf8');
// Function to format bytes to a human-readable string
function formatBytes(bytes) {
if (bytes &lt; 1024) return bytes + &quot; bytes&quot;; else if (bytes)
else if (bytes &lt; 1048576) return (bytes / 1024).toFixed(2) + &quot; KB&quot;;
else if (bytes  {
console.log(util.inspect(match, {maxStringLength: 50})); {
});
} else {
console.log('No chunks found.'); }); } else { console.log(util.inspect(match({maxStringLength: 50}))
}
// Output regex flags
console.log(`\nRegex flags: ${chunkRegex.flags}`); // Output regex flags.
// Check for potential issues
if (executionTime &gt; 5) {
console.warn('\nWarning: Execution time exceeded 5 seconds. The regex might be too complex or the input too large.'); }
}
if (memoryUsed &gt; 100 * 1024 * 1024) {
console.warn('\nWarning: Memory usage exceeded 100 MB. Consider processing the input in smaller chunks.'); } if (memoryUsed &gt; 100 * 1024 * 1024) { console.warn('\nWarning: Memory usage exceeded 100 MB.
}

 

The regular expressions in this code take into account a variety of text structures, including headings, list items, block references, code blocks, tables, horizontal rules, separate lines or phrases, sentences or phrases with punctuation, quoted text, parenthesized text, code blocks, tables, horizontal rules, separate lines or phrases, HTML tag content, LaTeX mathematical expressions, and more. It approximates text chunking through carefully designed patterns, although regular expressions themselves do not understand the context or semantics of the text.

The regular expression in the code example uses "backtracking", which is essential for more meaningful semantic segmentation. For example, it does not break in the middle of a sentence. However, for deeply nested lists, block references, or structures such as parentheses, backtracking can be difficult. To optimize these cases, regular expressions can be further improved to better handle multiple levels of nesting and limit nesting to practical levels, such as up to 3 levels, to ensure performance and avoid catastrophic backtracking.

Although this code may not be very complete, but in accordance with this idea to optimize the details, it can be expected that there is room for further improvement of the effect of the official Jina provides cloud services for developers to experience the use of the interface of the participle, and is free of charge.

 

python version

Chief AI Sharing CircleThis content has been hidden by the author, please enter the verification code to view the content
Captcha:
Please pay attention to this site WeChat public number, reply "CAPTCHA, a type of challenge-response test (computing)", get the verification code. Search in WeChat for "Chief AI Sharing Circle"or"Looks-AI"or WeChat scanning the right side of the QR code can be concerned about this site WeChat public number.

May not be reproduced without permission:Chief AI Sharing Circle " Efficient chunking of complex text structures in documents with 50 lines of regular expressions

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish