Skip to main content

Finding Median and Center Strings for a Probability Distribution on a Set of Strings Under Levenshtein Distance Based on Integer Linear Programming

  • Conference paper
  • First Online:
Book cover Biomedical Engineering Systems and Technologies (BIOSTEC 2016)

Abstract

For a data set composed of numbers or numerical vectors, a mean is the most fundamental measure for capturing the center of the data. However, for a data set of strings, a mean of the data cannot be defined, and therefore, median and center strings are frequently used as a measure of the center of the data. In contrast to calculating a mean of numerical data, constructing median and center strings of string data is not easy, and no algorithm is found that is guaranteed to construct exact solutions of center strings. In this study, we first generalize the definitions of median and center strings of string data into those of a probability distribution on a set of all strings composed of letters in a given alphabet. This generalization corresponds to that of a mean of numerical data into an expected value of a probability distribution on a set of numbers or numerical vectors. Next, we develop methods for constructing exact solutions of median and center strings for a probability distribution on a set of strings, applying integer linear programming. These methods are improved into faster ones by using the triangle inequality on the Levenshtein distance in the case where a set of strings is a metric space with the Levenshtein distance. Furthermore, we also develop methods for constructing approximate solutions of median and center strings very rapidly if the probability of a subset composed of similar strings is close to one. Lastly, we perform simulation experiments to examine the usefulness of our proposed methods in practical applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abreu, J., Rico-Juan, J.: A new iterative algorithm for computing a quality approximate median of strings based on edit operations. Pattern Recogn. Lett. 36, 74–80 (2014)

    Article  Google Scholar 

  2. Bunke, H., Jiang, X., Abegglen, K., Kandel, A.: On the weighted mean of a pair of strings. Pattern Anal. Appl. 5, 23–30 (2002)

    Article  MathSciNet  Google Scholar 

  3. Casacuberta, F., de Antoni, M.: A greedy algorithm for computing approximate median strings. In: Proceedings of National Symposium on Pattern Recognition and Image Analysis, pp. 193–198 (1997)

    Google Scholar 

  4. Chen, S., Tung, S., Fang, C., Cherng, S., Jain, A.: Extended attributed string matching for shape recognition. Comput. Vis. Image Underst. 70, 36–50 (1998)

    Article  Google Scholar 

  5. Dinu, L., Ionescu, R.: An efficient rank based based approach for closest string and closest substring. PLoS ONE 7(6), e37576 (2012)

    Article  Google Scholar 

  6. Gramm, J.: Fixed-parameter algorithms for the consensus analysis of genomic data. Ph.D. thesis, Universität Tübingen (2003)

    Google Scholar 

  7. Gramm, J., Niedermeier, R., Rossmanith, P.: Fixed-parameter algorithms for closest string and related problems. Algorithmica 37, 25–42 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  8. Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, New York (1997)

    Book  MATH  Google Scholar 

  9. Hamming, R.: Error detecting and error correcting codes. Bell Syst. Tech. J. 29(2), 147–160 (1950)

    Article  MathSciNet  Google Scholar 

  10. de la Higuera, C., Casacuberta, F.: Topology of strings: median string is NP-complete. Theoret. Comput. Sci. 230, 39–48 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  11. Hufsky, F., Kuchenbecker, L., Jahn, K., Stoye, J., Böcker, S.: Swiftly computing center strings. BMC Bioinform. 12, 106 (2011)

    Article  Google Scholar 

  12. Jiang, X., Abegglen, K., Bunke, H., Csirik, J.: Dynamic computation of generalised median strings. Pattern Anal. Appl. 6, 185–193 (2003)

    Article  MathSciNet  Google Scholar 

  13. Jiang, X., Bunke, H.: Optimal lower bound for generalized median problems in metric space. In: Caelli, T., Amin, A., Duin, R.P.W., Ridder, D., Kamel, M. (eds.) SSPR/SPR 2002. LNCS, vol. 2396, pp. 143–151. Springer, Heidelberg (2002). doi:10.1007/3-540-70659-3_14

    Chapter  Google Scholar 

  14. Jiang, X., Wentker, J., Ferrer, M.: Generalized median string computation by means of string embedding in vector spaces. Pattern Recogn. Lett. 33, 842–852 (2012)

    Article  Google Scholar 

  15. Kohonen, T.: Median strings. Pattern Recogn. Lett. 3, 309–313 (1985)

    Article  Google Scholar 

  16. Koyano, H., Kishino, H.: Quantifying biodiversity and asymptotics for a sequence of random strings. Phys. Rev. E 81(6), 061912 (2010)

    Article  MathSciNet  Google Scholar 

  17. Kruskal, J.: An overview of sequence comparison: time warps, string edits, and macromolecules. SIAM Rev. 25(2), 201–237 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  18. Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Doklady Adademii Nauk SSSR 163(4), 845–848 (1965)

    MathSciNet  MATH  Google Scholar 

  19. Lopresti, D., Zhou, J.: Using consensus sequence voting to correct OCR errors. Comput. Vis. Image Underst. 67(1), 39–47 (1997)

    Article  Google Scholar 

  20. Martínez-Hinarejos, C., Juan, A., Casacuberta, F.: Median strings for k-nearest neighbour classification. Pattern Recogn. Lett. 24, 173–181 (2003)

    Article  MATH  Google Scholar 

  21. Nicolas, F., Rivals, E.: Complexities of the centre and median string problems. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 315–327. Springer, Heidelberg (2003). doi:10.1007/3-540-44888-8_23

    Chapter  Google Scholar 

  22. Nicolas, F., Rivals, E.: Hardness results for the center and median string problems under the weighted and unweighted edit distances. J. Discrete Algorithms 3, 390–415 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  23. Olivares-Rodríguez, C., Oncina, J.: A stochastic approach to median string computation. In: da Vitoria Lobo, N., Kasparis, T., Roli, F., Kwok, J.T., Georgiopoulos, M., Anagnostopoulos, G.C., Loog, M. (eds.) SSPR/SPR 2008. LNCS, vol. 5342, pp. 431–440. Springer, Heidelberg (2008). doi:10.1007/978-3-540-89689-0_47

    Chapter  Google Scholar 

  24. Sim, J.S., Park, K.: The consensus string problem for a metric is NP-complete. J. Discrete Algorithms 1, 111–117 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  25. Wagner, R., Fischer, M.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  26. Winkler, W.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, pp. 354–359 (1990)

    Google Scholar 

Download references

Acknowledgements

This work was partially supported by Grants-in-Aid #24500361 and #26610037 from MEXT, Japan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Morihiro Hayashida .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Hayashida, M., Koyano, H. (2017). Finding Median and Center Strings for a Probability Distribution on a Set of Strings Under Levenshtein Distance Based on Integer Linear Programming. In: Fred, A., Gamboa, H. (eds) Biomedical Engineering Systems and Technologies. BIOSTEC 2016. Communications in Computer and Information Science, vol 690. Springer, Cham. https://doi.org/10.1007/978-3-319-54717-6_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-54717-6_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-54716-9

  • Online ISBN: 978-3-319-54717-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics