Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Textract: Extract text from a large variety of file formats
1 point by ch_sm on April 9, 2021 | hide | past | favorite | 3 comments


I'm assuming that this was the intended link:

https://aws.amazon.com/textract/


Huh. Must have made a mistake posting the original link. Anyway, this is what I meant: https://textract.readthedocs.io


This one’s interesting, because it seems to support more formats than Apache Tika and even includes speech recognition and OCR, all conveniently rolled into one package.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: