NLP Data Scientist

Location: Remote

*** Mention DataYoshi when applying ***

NLP @ Glean

New York City / Remote OK

ABOUT GLEAN (currently in stealth mode)

Glean is injecting intelligence into expense management. Founded by fintech veteran and seasoned CFO Howard Katzenberg, Glean is an AI-powered spend intelligence solution that saves SMBs money by analyzing expense drivers and finding line-item level insights overlooked by most accounts payable solutions, which are focused on speeding up payment cycles rather than optimizing vendor spend.

Over time, our mission is to become the all-in-one spend management solution for SMBs that uses AI to: analyze spend, manage approval and payment workflows, identify anomalous spend & savings opportunities, benchmark spend performance vs. peers, find & negotiate savings with vendors, and forecast future expense trends.

Glean will be every finance team's best friend!

Join us!


We are building the next billion-dollar business in enterprise SAAS. We have an amazing 12-person team, with experts in product, engineering, data science, and machine learning. We launched our MVP in May 2020 and have onboarded nearly two dozen clients since then. Now, we are building more of our core IP as we head towards our Series A fund raise later this fall.

We are searching for an NLP specialist to join our machine learning team; you will own the NLP models we build to extract data from invoices, map line items to their canonical forms, and reconcile the invoice data to the general ledger.

Here's how it works: After we ingest PDFs from clients, we perform OCR, extract the text using text classification models, validate the extraction with a human in the loop, and then map all the data into our canonical taxonomy.

Once the data is mapped to our taxonomy, we have a clean digital invoice-based data asset. We generate insights using this data asset, surfacing line-item level insights to our clients via our web app.

Your role will be critical to generating this intelligence for our clients.

Here are some of the major items you will own at Glean:

    • Code to clean and prepare the data extracted from invoices, including handling nulls, addressing exceptions, imputing data (when possible), and enriching data using third-party APIs
    • Code to validate the goodness of extractions performed by the ML models + human in the loop
    • In-house models to perform OCR and text classification
    • In-house NLP models to map the extracted data to our canonical taxonomy of vendors and line items (e.g., using word embeddings, etc.) at scale
    • In-house NLP models to perform two-way reconciliation between invoice data and the general ledger

You will work closely with the data engineers and data scientists to productionize the data science and machine learning code that is developed, and you will work closely with the engineering team to make sure the data science and machine learning pipelines are performant in production.

You'll be accountable for Glean's:

    • Data science vision – ownership over and definition of the data science pipeline, working closely with the executive team, and acting as a thought leader across the company
    • Technical execution – define, articulate, and execute the data science strategy from launch to scale, managing the day-to-day execution, and implementing best practices
    • Fast prototyping culture – Develop data science and machine learning applications fast, prototyping quickly to solve thorny business problems without getting bogged down in theory (ship early and often)
    • Getting your hands dirty - we're a small team and we have a lot of work ahead of us. You should be excited to roll up your sleeves and help the team in any way you can
    • Build, build, build – jumping in and writing data science code, from architecture to fixing bugs, delivering NLP models and insights within the first three months of hire. We'll rely on you and your team to handle day-to-day execution including:
    • Cleaning the extracted data, validating the accuracy of the extraction, and addressing issues as they occur
    • Shipping NLP applications to perform mapping of extracted data at scale
    • Developing and deploying models using PyTorch or TensorFlow
    • Writing code using PySpark
    • Performing feature engineering on the cleansed and mapped data
    • Generating insights and narratives
    • Wrangling, standardizing, enhancing, implementing, and monitoring data repositories
    • Creating workflows to ingest, enrich, and make data available across the Glean platform


You'll be a perfect fit for the Glean Team if…

    • You want to join an early-stage startup or you are extremely anxious to be challenged at your first startup
    • You like the tension between craft and shipping. You have strong ability to quickly and effectively evaluate technical tradeoffs and translate them into short/long term business decisions
    • You've built highly scalable data pipelines and have deep expertise in setting technical vision, architecting, building, and maintaining performant systems, from launch to scale
    • You are passionate about building / leading data science teams and rapidly developing data pipelines at various stages of growth
    • You pride yourself in communicating complex concepts, including the ability to distill intricate workflows and systems into clear processes and decisions with measurable company-wide impact
    • You ask “why” a lot and use critical thinking and data to back up your intuitions. You hate when a customer struggles through your product experience
    • You have managed a budget before and have seen first-hand the challenges in managing vendor spend


    • 3+ years of experience in Python, Data Science, Machine Learning, and NLP
    • 2+ years of experience with Spark, coding in either PySpark or Scala
    • Have built and deployed deep learning models using TensorFlow and/or PyTorch
    • Are well versed in NLP
    • Experience with Unsupervised Learning is a bonus

*** Mention DataYoshi when applying ***

Offers you may like...

  • Altana AI

    NLP Data Scientist / Engineer
    New York State
  • ING

    Senior NLP Data Scientist
  • Adecco Hong Kong

    NLP Data Analyst (12-month contract, banking)
    Yau Tsim Mong District, Kowloon
  • Adecco

    NLP Data Analyst (12-month contract, banking)
    Yau Tsim Mong District, Kowloon
  • Alldus

    NLP Data Scientist