Audio-visual scene understanding utilizing text information for a cooking support robot | IEEE Conference Publication | IEEE Xplore