Text Indexing for Faster Gapped Pattern Matching

Abstrak

We revisit the following version of the Gapped String Indexing problem, where the goal is to preprocess a text <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>T</mi><mo>[</mo><mn>1</mn><mrow><mo>.</mo><mspace width="0.166667em"></mspace><mo>.</mo></mrow><mi>n</mi><mo>]</mo></mrow></semantics></math></inline-formula> to enable efficient reporting of all <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mi>occ</mi></semantics></math></inline-formula> occurrences of a gapped pattern <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>P</mi><mo>=</mo><msub><mi>P</mi><mn>1</mn></msub><mrow><mo>[</mo><mi>α</mi><mrow><mo>.</mo><mspace width="0.166667em"></mspace><mo>.</mo></mrow><mi>β</mi><mo>]</mo></mrow><msub><mi>P</mi><mn>2</mn></msub></mrow></semantics></math></inline-formula> in T. An occurrence of P in T is defined as a pair <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mo>(</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>)</mo></mrow></semantics></math></inline-formula> where substrings <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>T</mi><mo>[</mo><mi>i</mi><mrow><mo>.</mo><mspace width="0.166667em"></mspace><mo>.</mo></mrow><mi>i</mi><mo>+</mo><mo>|</mo><msub><mi>P</mi><mn>1</mn></msub><mo>|</mo><mo>)</mo></mrow></semantics></math></inline-formula> and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>T</mi><mo>[</mo><mi>j</mi><mrow><mo>.</mo><mspace width="0.166667em"></mspace><mo>.</mo></mrow><mi>j</mi><mo>+</mo><mo>|</mo><msub><mi>P</mi><mn>2</mn></msub><mo>|</mo><mo>)</mo></mrow></semantics></math></inline-formula> match <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>P</mi><mn>1</mn></msub></semantics></math></inline-formula> and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>P</mi><mn>2</mn></msub></semantics></math></inline-formula>, respectively, with a gap <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>j</mi><mo>−</mo><mo>(</mo><mi>i</mi><mo>+</mo><mo>|</mo><msub><mi>P</mi><mn>1</mn></msub><mo>|</mo><mo>)</mo></mrow></semantics></math></inline-formula> lying within the interval <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mo>[</mo><mi>α</mi><mrow><mo>.</mo><mspace width="0.166667em"></mspace><mo>.</mo></mrow><mi>β</mi><mo>]</mo></mrow></semantics></math></inline-formula>. This problem has significant applications in computational biology and text mining. A hardness result on this problem suggests that any index with polylogarithmic query time must occupy near quadratic space. In a recent study [STACS 2024], Bille et al. presented a sub-quadratic space index using space <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mover accent="true"><mi mathvariant="script">O</mi><mo>˜</mo></mover><mrow><mo>(</mo><msup><mi>n</mi><mrow><mn>2</mn><mo>−</mo><mi>δ</mi><mo>/</mo><mn>3</mn></mrow></msup><mo>)</mo></mrow></mrow></semantics></math></inline-formula>, where <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>0</mn><mo>≤</mo><mi>δ</mi><mo>≤</mo><mn>1</mn></mrow></semantics></math></inline-formula> is a parameter fixed at the time of index construction. Its query time is <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mover accent="true"><mi mathvariant="script">O</mi><mo>˜</mo></mover><mrow><mo>(</mo><mo>|</mo></mrow><msub><mi>P</mi><mn>1</mn></msub><mrow><mo>|</mo><mo>+</mo><mo>|</mo></mrow><msub><mi>P</mi><mn>2</mn></msub><mrow><mo>|</mo><mo>+</mo><msup><mi>n</mi><mi>δ</mi></msup><mo>·</mo><mrow><mo>(</mo><mn>1</mn><mo>+</mo><mi>occ</mi><mo>)</mo></mrow><mo>)</mo></mrow></mrow></semantics></math></inline-formula>, which is sub-linear per occurrence when <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>δ</mi><mo><</mo><mn>1</mn></mrow></semantics></math></inline-formula>. We show how to achieve a gap-sensitive query time of <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mover accent="true"><mi mathvariant="script">O</mi><mo>˜</mo></mover><mrow><mo>(</mo><mo>|</mo></mrow><msub><mi>P</mi><mn>1</mn></msub><mrow><mo>|</mo><mo>+</mo><mo>|</mo></mrow><msub><mi>P</mi><mn>2</mn></msub><mrow><mo>|</mo><mo>+</mo><msup><mi>n</mi><mi>δ</mi></msup><mo>·</mo><mrow><mo>(</mo><mn>1</mn><mo>+</mo><msup><mi>occ</mi><mrow><mn>1</mn><mo>−</mo><mi>δ</mi></mrow></msup><mo>)</mo></mrow><mo>+</mo><msub><mo>∑</mo><mrow><mi>g</mi><mo>∈</mo><mo>[</mo><mi>α</mi><mrow><mo>.</mo><mspace width="0.166667em"></mspace><mo>.</mo></mrow><mi>β</mi><mo>]</mo></mrow></msub><msub><mi>occ</mi><mi>g</mi></msub><mo>·</mo><msup><mi>g</mi><mi>δ</mi></msup><mo>)</mo></mrow></mrow></semantics></math></inline-formula> using the same space, where <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>occ</mi><mi>g</mi></msub></semantics></math></inline-formula> denotes the number of occurrences with gap g. This is faster when there are many occurrences with small gaps.

Topik & Kata Kunci

Industrial engineering. Management engineering Electronic computers. Computer science

Penulis (3)

M

Md Helal Hossen

D

Daniel Gibney

S

Sharma V. Thankachan

Format Sitasi

APA MLA BibTeX

Hossen, M.H., Gibney, D., Thankachan, S.V. (2024). Text Indexing for Faster Gapped Pattern Matching. https://doi.org/10.3390/a17120537

Text Indexing for Faster Gapped Pattern Matching

Abstrak

Topik & Kata Kunci

Penulis (3)

Format Sitasi

Akses Cepat