DOAJ Open Access 2024

Text Indexing for Faster Gapped Pattern Matching

Md Helal Hossen Daniel Gibney Sharma V. Thankachan

Abstrak

We revisit the following version of the Gapped String Indexing problem, where the goal is to preprocess a text <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>T</mi><mo>[</mo><mn>1</mn><mrow><mo>.</mo><mspace width="0.166667em"></mspace><mo>.</mo></mrow><mi>n</mi><mo>]</mo></mrow></semantics></math></inline-formula> to enable efficient reporting of all <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mi>occ</mi></semantics></math></inline-formula> occurrences of a gapped pattern <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>P</mi><mo>=</mo><msub><mi>P</mi><mn>1</mn></msub><mrow><mo>[</mo><mi>α</mi><mrow><mo>.</mo><mspace width="0.166667em"></mspace><mo>.</mo></mrow><mi>β</mi><mo>]</mo></mrow><msub><mi>P</mi><mn>2</mn></msub></mrow></semantics></math></inline-formula> in <i>T</i>. An occurrence of <i>P</i> in <i>T</i> is defined as a pair <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mo>(</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>)</mo></mrow></semantics></math></inline-formula> where substrings <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>T</mi><mo>[</mo><mi>i</mi><mrow><mo>.</mo><mspace width="0.166667em"></mspace><mo>.</mo></mrow><mi>i</mi><mo>+</mo><mo>|</mo><msub><mi>P</mi><mn>1</mn></msub><mo>|</mo><mo>)</mo></mrow></semantics></math></inline-formula> and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>T</mi><mo>[</mo><mi>j</mi><mrow><mo>.</mo><mspace width="0.166667em"></mspace><mo>.</mo></mrow><mi>j</mi><mo>+</mo><mo>|</mo><msub><mi>P</mi><mn>2</mn></msub><mo>|</mo><mo>)</mo></mrow></semantics></math></inline-formula> match <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>P</mi><mn>1</mn></msub></semantics></math></inline-formula> and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>P</mi><mn>2</mn></msub></semantics></math></inline-formula>, respectively, with a gap <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>j</mi><mo>−</mo><mo>(</mo><mi>i</mi><mo>+</mo><mo>|</mo><msub><mi>P</mi><mn>1</mn></msub><mo>|</mo><mo>)</mo></mrow></semantics></math></inline-formula> lying within the interval <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mo>[</mo><mi>α</mi><mrow><mo>.</mo><mspace width="0.166667em"></mspace><mo>.</mo></mrow><mi>β</mi><mo>]</mo></mrow></semantics></math></inline-formula>. This problem has significant applications in computational biology and text mining. A hardness result on this problem suggests that any index with polylogarithmic query time must occupy near quadratic space. In a recent study [STACS 2024], Bille et al. presented a sub-quadratic space index using space <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mover accent="true"><mi mathvariant="script">O</mi><mo>˜</mo></mover><mrow><mo>(</mo><msup><mi>n</mi><mrow><mn>2</mn><mo>−</mo><mi>δ</mi><mo>/</mo><mn>3</mn></mrow></msup><mo>)</mo></mrow></mrow></semantics></math></inline-formula>, where <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>0</mn><mo>≤</mo><mi>δ</mi><mo>≤</mo><mn>1</mn></mrow></semantics></math></inline-formula> is a parameter fixed at the time of index construction. Its query time is <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mover accent="true"><mi mathvariant="script">O</mi><mo>˜</mo></mover><mrow><mo>(</mo><mo>|</mo></mrow><msub><mi>P</mi><mn>1</mn></msub><mrow><mo>|</mo><mo>+</mo><mo>|</mo></mrow><msub><mi>P</mi><mn>2</mn></msub><mrow><mo>|</mo><mo>+</mo><msup><mi>n</mi><mi>δ</mi></msup><mo>·</mo><mrow><mo>(</mo><mn>1</mn><mo>+</mo><mi>occ</mi><mo>)</mo></mrow><mo>)</mo></mrow></mrow></semantics></math></inline-formula>, which is sub-linear per occurrence when <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>δ</mi><mo><</mo><mn>1</mn></mrow></semantics></math></inline-formula>. We show how to achieve a gap-sensitive query time of <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mover accent="true"><mi mathvariant="script">O</mi><mo>˜</mo></mover><mrow><mo>(</mo><mo>|</mo></mrow><msub><mi>P</mi><mn>1</mn></msub><mrow><mo>|</mo><mo>+</mo><mo>|</mo></mrow><msub><mi>P</mi><mn>2</mn></msub><mrow><mo>|</mo><mo>+</mo><msup><mi>n</mi><mi>δ</mi></msup><mo>·</mo><mrow><mo>(</mo><mn>1</mn><mo>+</mo><msup><mi>occ</mi><mrow><mn>1</mn><mo>−</mo><mi>δ</mi></mrow></msup><mo>)</mo></mrow><mo>+</mo><msub><mo>∑</mo><mrow><mi>g</mi><mo>∈</mo><mo>[</mo><mi>α</mi><mrow><mo>.</mo><mspace width="0.166667em"></mspace><mo>.</mo></mrow><mi>β</mi><mo>]</mo></mrow></msub><msub><mi>occ</mi><mi>g</mi></msub><mo>·</mo><msup><mi>g</mi><mi>δ</mi></msup><mo>)</mo></mrow></mrow></semantics></math></inline-formula> using the same space, where <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>occ</mi><mi>g</mi></msub></semantics></math></inline-formula> denotes the number of occurrences with gap <i>g</i>. This is faster when there are many occurrences with small gaps.

Penulis (3)

M

Md Helal Hossen

D

Daniel Gibney

S

Sharma V. Thankachan

Format Sitasi

Hossen, M.H., Gibney, D., Thankachan, S.V. (2024). Text Indexing for Faster Gapped Pattern Matching. https://doi.org/10.3390/a17120537

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.3390/a17120537
Informasi Jurnal
Tahun Terbit
2024
Sumber Database
DOAJ
DOI
10.3390/a17120537
Akses
Open Access ✓