Table Topic Models for Hidden Unit Estimation

Yoshida, Minoru; Matsumoto, Kazuyuki; Kita, Kenji

doi:10.1007/978-3-319-48051-0_23

Minoru Yoshida²⁰,
Kazuyuki Matsumoto²⁰ &
Kenji Kita²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9994))

Included in the following conference series:

Asia Information Retrieval Symposium

869 Accesses
2 Citations

Abstract

We propose a method to estimate hidden units of numbers written in tables. We focus on Wikipedia tables and propose an algorithm to estimate which units are appropriate for a given cell that has a number but no unit words. We try to estimate such hidden units using surrounding contexts such as a cell in the first row. To improve the performance, we propose the table topic model that can model tables and surrounding sentences simultaneously.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The total number of tables founded in the corpus was 255,039.
2.
We observed that 39.1 % cells out of randomly selected number-cells were number-only cells (i.e., cells without any unit).
3.
Note that in this paper the \(x_i\) is assumed to be a vector whose value is 1 in the i-th dimension where i is the ID of every context word for the cell.
4.
If we see the number “1987”, we think of it as a number that indicates a year.
5.
For example, if the unit word is “yen”, the surrounding words are likely to contain the word “price”.
6.
We observed that using all “same row” cells worsen the accuracy in preliminary experiments, so we do not use those cells.
7.
In our data set, 266 (93.7 %) out of 284 tables (which is the tables that contains one or more hand-annotated cells) were row-wise.
8.
It is inspired by the Polya-tree models for modeling of continuous values.
9.
We also use some additional digits such as for signs, but omit them here for the sake of simplicity.
10.
We set \(N=2\) currently.
11.
We use some rules to parse the number string, so different expressions like “95,300” are also available.
12.
We divided the corpus in such a way that the cells from the same table are not included in the same subset. The accuracy is calculated by summing up the correct/incorrect of predictions on each cell, i.e., the accuracy is micro-averaged one.
13.
Each Gibbs sampling performed 500 iterations. The distribution of the sampled topic IDs in the final 200 iterations were used as the input features for the logistic regression (i.e., we added each topic ID observed for the column of each cell in the test data with their relative frequency as a weight.).

References

Andrew, G., Gao, J.: Scalable training of l1-regularized log-linear models. In: Proceedings of ICML 2007, pp. 33–40 (2007)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. Proc. VLDB Endow. 1(1), 538–549 (2008)
Article Google Scholar
Govindaraju, V., Zhang, C., Re, C.: Understanding tables in context using standard NLP toolkits. In: Proceedings of ACL2013 (2013)
Google Scholar
Narisawa, K., Watanabe, Y., Mizuno, J., Okazaki, N., Inui, K.: Is a 204 cm man tall or small? Acquisition of numerical common sense from the web. In: Proceedings of the ACL, vol. 1, pp. 382–391 (2013)
Google Scholar
Okazaki, N.: Classias: a collection of machine-learning algorithms for classification. http://www.chokkan.org/software/classias/
Sarawagi, S., Chakrabarti, S.: Open-domain quantity queries on web tables: annotation, response, and consensus models. In: Proceedings of KDD, pp. 711–720 (2014)
Google Scholar
Takamura, H., Tsujii, J.: Estimating numerical attributes by bringing together fragmentary clues. In: Proceedings of NAACL-HLT2015 (2015)
Google Scholar
Wang, H., Liu, A., Wang, J., Ziebart, B.D., Yu, C.T., Shen, W.: Context retrieval for web tables. In: Proceedings of ICTIR 2015, pp. 251–260 (2015)
Google Scholar
Yoshida, M., Sato, I., Nakagawa, H., Terada, A.: Mining numbers in text using suffix arrays and clustering based on Dirichlet process mixture models. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS, vol. 6119, pp. 230–237. Springer, Heidelberg (2010)
Chapter Google Scholar
Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition. Int. J. Doc. Anal. Recogn. 7(1), 1–16 (2004)
Article Google Scholar

Download references

Acknowledgement

This work was supported by JSPS KAKENHI Grant Numbers JP15K00309, JP15K00425, JP15K16077.

Author information

Authors and Affiliations

Institute of Technology and Science, University of Tokushima, 2-1, Minami-josanjima, Tokushima, 770-8506, Japan
Minoru Yoshida, Kazuyuki Matsumoto & Kenji Kita

Authors

Minoru Yoshida
View author publications
You can also search for this author in PubMed Google Scholar
Kazuyuki Matsumoto
View author publications
You can also search for this author in PubMed Google Scholar
Kenji Kita
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Minoru Yoshida .

Editor information

Editors and Affiliations

Tsinghua University , Beijing, China
Shaoping Ma
Renmin University of China , Beijing, China
Ji-Rong Wen
Tsinghua University , Beijing, China
Yiqun Liu
Renmin University of China , Beijing, China
Zhicheng Dou
Tsinghua University , Beijing, China
Min Zhang
Yahoo Labs , Sunnyvale, California, USA
Yi Chang
Renmin University of China , Beijing, China
Xin Zhao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yoshida, M., Matsumoto, K., Kita, K. (2016). Table Topic Models for Hidden Unit Estimation. In: Ma, S., et al. Information Retrieval Technology. AIRS 2016. Lecture Notes in Computer Science(), vol 9994. Springer, Cham. https://doi.org/10.1007/978-3-319-48051-0_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-48051-0_23
Published: 15 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48050-3
Online ISBN: 978-3-319-48051-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics