Senior data engineer

Ancora

Pubblicato il Pubblicato 11h fa

Descrizione

PstrongAncora /strong is building an AI-native accounting software that replaces traditional accounting management systems. We#39;re not improving the status quo — we#39;re replacing it entirely: from software that waits for human input to an autonomous agent that performs accounting under professional supervision. /ppbr / /ppWe#39;re an Italian startup based in Milan. Our model combines a strongtechnology platform /strong that automates the operational work these firms do every day: bookkeeping, tax filings, document management, compliance deadlines with a strongroll-up strategy /strong — acquiring and consolidating accounting firms (studi commercialisti) across Italy. /ppbr / /ppThe vision is clear: free accounting professionals from repetitive tasks so they can focus on what actually requires human judgment — strategic consulting, client relationships, and growing their practice. We#39;re building the infrastructure that makes this possible at scale. /ppbr / /ppstrongWhere we are today. /strong We#39;re venture-backed by some of Italy#39;s best investors, with our first studio acquisitions underway. The product is greenfield — zero legacy code, modern stack, built from scratch. Our engineering team is designing the entire architecture now. The decisions we make today will shape the system for years. /ppbr / /ppstrongWhat makes Ancora different. /strong We own the problem end-to-end: we build the technology, we acquire the firms, we operate the service. This means we control the feedback loop between what accountants need and what we build. We#39;re not selling software to reluctant buyers — we#39;re building it for firms we operate. The technology has to work because our business depends on it. Italian accounting is deeply regulated, complex, and largely untouched by modern software. The domain complexity is real, and that#39;s what makes it interesting. /ppAbout the Role /ppAs our Data Engineer, you#39;ll design and build the data infrastructure that powers our system: from ingestion pipelines that handle heterogeneous document formats, to the RAG and knowledge graph architecture that enables intelligent retrieval and reasoning. /ppbr / /ppstrongThe technical challenges. /strong You#39;ll tackle temporal versioning at scale — tracking how authoritative documents evolve over time with complex effective dates, retroactive changes, and transitional provisions. You#39;ll parse natural language amendments, extracting structured diffs from modifications like "replace X with Y in paragraph 3" and reconstructing consolidated versions programmatically. You#39;ll build multi-layer knowledge graphs connecting source documents to their interpretations, amendments, and operational mappings, preserving semantic authority levels across the graph. You#39;ll normalize heterogeneous sources — ingesting from dozens of formats (PDFs, HTML, XML, scanned documents) with no standardized structure — into a unified, queryable corpus. And you#39;ll design context-dependent retrieval systems where the correct answer depends not just on the query, but on multi-dimensional context: time, jurisdiction, entity profile. /ppWhat You#39;ll Do /ppbr / /ppstrongBuild the data ingestion and normalization infrastructure. /strong Design multi-format ingestion pipelines (PDFs with OCR, HTML, XML, scanned documents) from heterogeneous sources. Transform documents into a unified schema while preserving semantic distinctions — authority levels, document types, versioning metadata. Handle edge cases: implicit cross-references, evolving formats, natural-language amendments, missing metadata. Build validation pipelines to catch ingestion errors and monitor source freshness. /ppbr / /ppstrongOwn the RAG and knowledge graph architecture. /strong Design hierarchical RAG systems with chunk-level embeddings, document-level summaries, and cross-document relationship modeling. Construct knowledge graphs connecting source documents to their interpretations, amendments, and operational mappings. Extract relationships from complex text — references, hierarchies, temporal dependencies — using NLP and LLM-based approaches. Implement temporal versioning with complex effective date logic and retroactive change tracking. /ppstrongEnsure data quality and enable downstream AI. /strong Build tooling for data quality audits, anomaly detection, and validation at scale. Provide retrieval APIs (RAG + graph queries) for the reasoning engine to consume. Design systems where retrieval accuracy directly determines AI agent correctness. /ppbr / /ppWhat We#39;re Looking For /ppbr / /ppMust Have /pullistrong3+ years of experience /strong building production data pipelines (ETL/ELT). /lilistrongStrong proficiency in Python /strong — our primary language for data work. /lilistrongExperience with data extraction from messy sources /strong — PDFs, HTML scraping, document parsing. /lilistrongHands-on experience with data orchestration tools /strong (Airflow, Prefect, Dagster, or similar). /lilistrongSolid understanding of data modeling /strong and schema design. /lilistrongExperience with SQL and NoSQL databases /strong (Postgres, MongoDB, or similar). /lilistrongAbility to write robust, testable, maintainable code. /strong /lilistrongComfort working with ambiguity /strong and iterating on solutions. /li /ulpbr / /ppNice to Have /pullistrongCloud environments /strong (AWS preferred) and stronginfrastructure as code /strong (Terraform). /lilistrongRAG systems /strong — vector databases (Pinecone, Weaviate, Qdrant), embedding models, retrieval strategies. /lilistrongGraph databases /strong (Neo4j, Neptune) and graph query languages (Cypher, Gremlin). /lilistrongOCR pipelines /strong (Tesseract, cloud OCR services). /lilistrongNLP for information extraction /strong — entity recognition, relationship extraction, LLM-based parsing. /lilistrongRegulated industries /strong (legal, finance, healthcare) where data quality is critical. /lilistrongMLOps practices /strong — feature stores, data versioning (DVC), model monitoring. /li /ulpbr / /ppMindset /pullistrongDetail-oriented. /strong You understand that edge cases matter when dealing with complex documents. /lilistrongPragmatic. /strong You balance "perfect" with "good enough to ship" and iterate. /lilistrongCurious. /strong You want to understand the domain, not just move data around. /lilistrongCollaborative. /strong You work closely with ML engineers, backend engineers, and domain experts. /li /ulpbr / /ppWhy This Role is Interesting /pullistrongFoundational impact. /strong Your work directly determines whether AI agents can operate autonomously in a regulated domain. No data quality = no intelligence. /lilistrongUnique technical challenges. /strong Build temporal knowledge graphs for constantly-evolving authoritative documents, parse natural-language amendments into structured diffs, design context-dependent RAG systems — problems with no existing solutions. /lilistrongGreenfield with modern practices. /strong Zero legacy code, design the data architecture from scratch, build with best-in-class tools: vector DBs, graph databases, modern orchestration. /lilistrongFull-stack ownership. /strong Own everything from raw PDFs to the knowledge infrastructure that powers AI reasoning, working directly with the founding team. /li /ulpbr / /ppWhat We Offer /pullistrongOffice First. /strong Collaboration is easier and more effective in person in our Milan HQ. You can also enjoy working from home up to 30% of the time, while enjoying great company during our three core days in the office. /lilistrongCompensation. /strong €50,000 – €75,000 gross annual salary, based on experience. /lilistrongEquity. /strong Meaningful stock options package, based on experience and scope. Equity is offered with 0 strike price and a strong upside potential given stage of the company /lilistrongBenefits. /strong Meal vouchers and fringe benefits. /lilistrongTeam. /strong Direct collaboration with domain experts and the founding team. /lilistrongAutonomy. /strong Strong voice in technical decisions, from data architecture to tooling choices. /lilistrongGrowth. /strong Opportunity to build from zero a critical piece of infrastructure as the team scales. /li /ulpbr / /ppHow to Apply /ppSend your CV to with a brief note on why this role interests you. /ppemWe#39;re an equal opportunity employer. We value diversity and encourage applications from people of all backgrounds. /em /p

Rispondere all'offerta

Crea una notifica

Salva

Offerta simile

Casting tfp — per chi sa ancora rispondere a un messaggio.

Lecce

FOTO UROBORO

Offerta simile

Sales assistant 20h - porto sant'elpidio cc le ancore

Porto Sant'Elpidio

Altro

Commesso

Offerta simile

Tatuatore per fare un tatuaggio su devo ancora decidere (vercelli)

Vercelli

Cronoshare.it

Tatuatore