We are looking for a skilled and motivated Mid-level Java Developer to join our dynamic data acquisition team. In this role, you will be responsible for building and maintaining robust web scrapers that form the backbone of our data-driven insights.
You will work on extracting both structured and unstructured data from a wide array of web sources, ensuring the efficiency, accuracy, and reliability of our data pipelines. This position requires a strong foundation in Java development, a good understanding of web technologies, and a passion for solving the unique challenges presented by web scraping at scale.
What you'll do
- Develop and Maintain Scrapers: Build, deploy, and maintain efficient and reliable web scrapers using Java and its core libraries to extract data from diverse websites and online sources.
- Automate and Schedule: Design and implement scripts to automate repetitive scraping tasks, scheduling jobs using tools like cron or enterprise schedulers (e.g., Airflow) to ensure timely data collection.
- Data Storage and Management: Store and manage scraped data effectively in various databases, including SQL and NoSQL solutions, as well as cloud-based storage platforms.
- Overcome Scraping Hurdles: Employ various tools and techniques to successfully navigate and bypass common web scraping obstacles such as CAPTCHAs, dynamic content loading, and IP blocking.
- Optimize for Performance: Ensure scrapers are optimized for performance and scalability, capable of handling large-scale data extraction tasks without compromising system stability.
- Data Processing and Cleansing: Transform raw scraped data into clean, structured formats like CSV and JSON. Implement data validation and cleansing processes to guarantee data quality and integrity.
- Ensure Compliance: Adhere to web scraping best practices and ensure all data acquisition activities are in compliance with legal and ethical standards, including website terms of service.
- Collaborate Effectively: Work closely with data analysts, product managers, and other developers to understand data requirements and deliver high-quality, actionable data.
What experience do you need
- A Bachelor’s degree in Computer Science, Software Engineering, Information Technology, or a related technical field.
- 3+ years of professional experience in software engineering with a strong focus on Java development and proven experience writing Java code to extract data from websites, ensuring efficiency, accuracy, and adherence to best practices.
- 2+ years of experience with web technologies, including a solid understanding of JavaScript, HTML, CSS, and XML for effective entity extraction and hands-on experience designing, querying, and managing data in both SQL or NoSQL databases.
- 2+ years of experience with core Java web scraping libraries such as Jsoup for HTML parsing and browser automation tools like Selenium or HtmlUnit for handling dynamic, JavaScript-rendered content, handling data formats like JSON and CSV, coupled with experience in data cleaning and validation techniques.
- English proficiency of B2 or higher.
What could set you apart
- Experience with cloud platforms such as AWS, Google Cloud, or Azure for deploying and managing scraping infrastructure.
- A foundational understanding of network traffic analysis.
- Familiarity with the full Software Development Life Cycle (SDLC), including testing and quality assurance.
- Proficiency with version control systems, particularly Git, for collaborative development.
- Experience with CI/CD pipelines and associated tools.
- A keen understanding of the importance of respecting website terms of service and practicing ethical scraping.
Primary Location:
CRI-Sabana
Function:
Function - Tech Dev and Client Services
Schedule:
Full time
Read Full Description