Abstract
We propose a method to estimate hidden units of numbers written in tables. We focus on Wikipedia tables and propose an algorithm to estimate which units are appropriate for a given cell that has a number but no unit words. We try to estimate such hidden units using surrounding contexts such as a cell in the first row. To improve the performance, we propose the table topic model that can model tables and surrounding sentences simultaneously.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The total number of tables founded in the corpus was 255,039.
- 2.
We observed that 39.1 % cells out of randomly selected number-cells were number-only cells (i.e., cells without any unit).
- 3.
Note that in this paper the \(x_i\) is assumed to be a vector whose value is 1 in the i-th dimension where i is the ID of every context word for the cell.
- 4.
If we see the number “1987”, we think of it as a number that indicates a year.
- 5.
For example, if the unit word is “yen”, the surrounding words are likely to contain the word “price”.
- 6.
We observed that using all “same row” cells worsen the accuracy in preliminary experiments, so we do not use those cells.
- 7.
In our data set, 266 (93.7 %) out of 284 tables (which is the tables that contains one or more hand-annotated cells) were row-wise.
- 8.
It is inspired by the Polya-tree models for modeling of continuous values.
- 9.
We also use some additional digits such as for signs, but omit them here for the sake of simplicity.
- 10.
We set \(N=2\) currently.
- 11.
We use some rules to parse the number string, so different expressions like “95,300” are also available.
- 12.
We divided the corpus in such a way that the cells from the same table are not included in the same subset. The accuracy is calculated by summing up the correct/incorrect of predictions on each cell, i.e., the accuracy is micro-averaged one.
- 13.
Each Gibbs sampling performed 500 iterations. The distribution of the sampled topic IDs in the final 200 iterations were used as the input features for the logistic regression (i.e., we added each topic ID observed for the column of each cell in the test data with their relative frequency as a weight.).
References
Andrew, G., Gao, J.: Scalable training of l1-regularized log-linear models. In: Proceedings of ICML 2007, pp. 33–40 (2007)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. Proc. VLDB Endow. 1(1), 538–549 (2008)
Govindaraju, V., Zhang, C., Re, C.: Understanding tables in context using standard NLP toolkits. In: Proceedings of ACL2013 (2013)
Narisawa, K., Watanabe, Y., Mizuno, J., Okazaki, N., Inui, K.: Is a 204 cm man tall or small? Acquisition of numerical common sense from the web. In: Proceedings of the ACL, vol. 1, pp. 382–391 (2013)
Okazaki, N.: Classias: a collection of machine-learning algorithms for classification. http://www.chokkan.org/software/classias/
Sarawagi, S., Chakrabarti, S.: Open-domain quantity queries on web tables: annotation, response, and consensus models. In: Proceedings of KDD, pp. 711–720 (2014)
Takamura, H., Tsujii, J.: Estimating numerical attributes by bringing together fragmentary clues. In: Proceedings of NAACL-HLT2015 (2015)
Wang, H., Liu, A., Wang, J., Ziebart, B.D., Yu, C.T., Shen, W.: Context retrieval for web tables. In: Proceedings of ICTIR 2015, pp. 251–260 (2015)
Yoshida, M., Sato, I., Nakagawa, H., Terada, A.: Mining numbers in text using suffix arrays and clustering based on Dirichlet process mixture models. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS, vol. 6119, pp. 230–237. Springer, Heidelberg (2010)
Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition. Int. J. Doc. Anal. Recogn. 7(1), 1–16 (2004)
Acknowledgement
This work was supported by JSPS KAKENHI Grant Numbers JP15K00309, JP15K00425, JP15K16077.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Yoshida, M., Matsumoto, K., Kita, K. (2016). Table Topic Models for Hidden Unit Estimation. In: Ma, S., et al. Information Retrieval Technology. AIRS 2016. Lecture Notes in Computer Science(), vol 9994. Springer, Cham. https://doi.org/10.1007/978-3-319-48051-0_23
Download citation
DOI: https://doi.org/10.1007/978-3-319-48051-0_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48050-3
Online ISBN: 978-3-319-48051-0
eBook Packages: Computer ScienceComputer Science (R0)