Imagine you're tasked with creating a tool for internal analytics at LinkedIn. Outline a pipeline that processes images and PDFs of resumes, converting them into searchable text data.
Your data pipeline should achieve the following outcomes:
1. Establish a data mart enabling machine learning models to access text data for natural language processing tasks.
2. Develop a data product that company analysts can use to monitor specific keywords.
3. Implement a search API that enables recruiters to find candidates based on keyword searches.
Assume the following:
- The image-to-text conversion models are reliable and ready for deployment.
- The data does not require real-time processing but should have a quick turnaround.
- Privacy and security considerations are being handled by another team, so they need not be part of your design.
Please state any other assumptions you make at the outset.