Text Indexing for Faster Gapped Pattern Matching
Abstrak
We revisit the following version of the Gapped String Indexing problem, where the goal is to preprocess a text <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>T</mi><mo>[</mo><mn>1</mn><mrow><mo>.</mo><mspace width="0.166667em"></mspace><mo>.</mo></mrow><mi>n</mi><mo>]</mo></mrow></semantics></math></inline-formula> to enable efficient reporting of all <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mi>occ</mi></semantics></math></inline-formula> occurrences of a gapped pattern <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>P</mi><mo>=</mo><msub><mi>P</mi><mn>1</mn></msub><mrow><mo>[</mo><mi>α</mi><mrow><mo>.</mo><mspace width="0.166667em"></mspace><mo>.</mo></mrow><mi>β</mi><mo>]</mo></mrow><msub><mi>P</mi><mn>2</mn></msub></mrow></semantics></math></inline-formula> in <i>T</i>. An occurrence of <i>P</i> in <i>T</i> is defined as a pair <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mo>(</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>)</mo></mrow></semantics></math></inline-formula> where substrings <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>T</mi><mo>[</mo><mi>i</mi><mrow><mo>.</mo><mspace width="0.166667em"></mspace><mo>.</mo></mrow><mi>i</mi><mo>+</mo><mo>|</mo><msub><mi>P</mi><mn>1</mn></msub><mo>|</mo><mo>)</mo></mrow></semantics></math></inline-formula> and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>T</mi><mo>[</mo><mi>j</mi><mrow><mo>.</mo><mspace width="0.166667em"></mspace><mo>.</mo></mrow><mi>j</mi><mo>+</mo><mo>|</mo><msub><mi>P</mi><mn>2</mn></msub><mo>|</mo><mo>)</mo></mrow></semantics></math></inline-formula> match <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>P</mi><mn>1</mn></msub></semantics></math></inline-formula> and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>P</mi><mn>2</mn></msub></semantics></math></inline-formula>, respectively, with a gap <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>j</mi><mo>−</mo><mo>(</mo><mi>i</mi><mo>+</mo><mo>|</mo><msub><mi>P</mi><mn>1</mn></msub><mo>|</mo><mo>)</mo></mrow></semantics></math></inline-formula> lying within the interval <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mo>[</mo><mi>α</mi><mrow><mo>.</mo><mspace width="0.166667em"></mspace><mo>.</mo></mrow><mi>β</mi><mo>]</mo></mrow></semantics></math></inline-formula>. This problem has significant applications in computational biology and text mining. A hardness result on this problem suggests that any index with polylogarithmic query time must occupy near quadratic space. In a recent study [STACS 2024], Bille et al. presented a sub-quadratic space index using space <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mover accent="true"><mi mathvariant="script">O</mi><mo>˜</mo></mover><mrow><mo>(</mo><msup><mi>n</mi><mrow><mn>2</mn><mo>−</mo><mi>δ</mi><mo>/</mo><mn>3</mn></mrow></msup><mo>)</mo></mrow></mrow></semantics></math></inline-formula>, where <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>0</mn><mo>≤</mo><mi>δ</mi><mo>≤</mo><mn>1</mn></mrow></semantics></math></inline-formula> is a parameter fixed at the time of index construction. Its query time is <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mover accent="true"><mi mathvariant="script">O</mi><mo>˜</mo></mover><mrow><mo>(</mo><mo>|</mo></mrow><msub><mi>P</mi><mn>1</mn></msub><mrow><mo>|</mo><mo>+</mo><mo>|</mo></mrow><msub><mi>P</mi><mn>2</mn></msub><mrow><mo>|</mo><mo>+</mo><msup><mi>n</mi><mi>δ</mi></msup><mo>·</mo><mrow><mo>(</mo><mn>1</mn><mo>+</mo><mi>occ</mi><mo>)</mo></mrow><mo>)</mo></mrow></mrow></semantics></math></inline-formula>, which is sub-linear per occurrence when <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>δ</mi><mo><</mo><mn>1</mn></mrow></semantics></math></inline-formula>. We show how to achieve a gap-sensitive query time of <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mover accent="true"><mi mathvariant="script">O</mi><mo>˜</mo></mover><mrow><mo>(</mo><mo>|</mo></mrow><msub><mi>P</mi><mn>1</mn></msub><mrow><mo>|</mo><mo>+</mo><mo>|</mo></mrow><msub><mi>P</mi><mn>2</mn></msub><mrow><mo>|</mo><mo>+</mo><msup><mi>n</mi><mi>δ</mi></msup><mo>·</mo><mrow><mo>(</mo><mn>1</mn><mo>+</mo><msup><mi>occ</mi><mrow><mn>1</mn><mo>−</mo><mi>δ</mi></mrow></msup><mo>)</mo></mrow><mo>+</mo><msub><mo>∑</mo><mrow><mi>g</mi><mo>∈</mo><mo>[</mo><mi>α</mi><mrow><mo>.</mo><mspace width="0.166667em"></mspace><mo>.</mo></mrow><mi>β</mi><mo>]</mo></mrow></msub><msub><mi>occ</mi><mi>g</mi></msub><mo>·</mo><msup><mi>g</mi><mi>δ</mi></msup><mo>)</mo></mrow></mrow></semantics></math></inline-formula> using the same space, where <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>occ</mi><mi>g</mi></msub></semantics></math></inline-formula> denotes the number of occurrences with gap <i>g</i>. This is faster when there are many occurrences with small gaps.
Topik & Kata Kunci
Penulis (3)
Md Helal Hossen
Daniel Gibney
Sharma V. Thankachan
Akses Cepat
- Tahun Terbit
- 2024
- Sumber Database
- DOAJ
- DOI
- 10.3390/a17120537
- Akses
- Open Access ✓