AI-Powered Employee Data Quality Optimization & Classification

Combining Semi-Supervised Clustering With LLM-Based Semantic Interpretation in a UX-First Approach
Our client, a Belgian HR specialist, offers a wide range of services related to human resources and social administration for businesses and self-employed individuals. Through an online platform, employers and employees have access to various services such as payroll administration, wellbeing advice, and more. With an extensive active membership base and monthly visitors in the hundreds of thousands, the platform experiences significant traffic.

Platform users expect a frictionless digital journey where they can interact at any time, looking for an integrated, one-stop-shop service experience. As an advisor and service provider, the client is dedicated to delivering accurate and tailored support to its customers. In this context, having reliable and accurate customer data is crucial for extracting relevant insights. This necessity brings high demands in terms of data quality—no small feat for a fast-growing organization with legacy systems built over the years.

At Faktion, we strongly believe in a data-centric approach to AI, and in using AI to optimize data quality. This viewpoint is also embraced by our client’s leadership, which has made data quality and data governance a strategic priority. Hence, when looking for a partner that could help them take data quality to the next level, our client found a natural ally in Faktion.

The Situation "As Is": Unstructured Data in Free-Text Fields

Managing a large-scale HR services platform requires substantial effort in user management while also ensuring compliance with data protection and privacy regulations. A large portion of this data comes from older legacy systems, making some records incomplete or incorrect. According to the principle of “garbage in, garbage out,” poor data yields poor insights.

As an advisor, having the correct information about customers is essential to adapting communication and services. For instance, imagine a prevention and wellbeing advisor helping a business owner ensure all legal safety standards are met. Naturally, the advisor needs to know who the employees are and what their job titles or professional activities entail, as safety standards differ by role. In an ideal scenario, multiple activities or job titles link to an overarching reference title, which in turn is tied to specific safety regulations.

Historically, however, new employees’ job titles were added as free-text fields. Originally, the decision to allow for free-text input was meant to simplify the user experience—offering maximum flexibility for individuals when adding their roles, rather than forcing them to scroll through a very long list of job titles. But over time, those free-text entries have created challenges:

  • Overly Detailed Descriptions: Some descriptions are too granular, making it difficult to identify the appropriate reference title.

  • Duplicates: Free-text entries often lead to variations (e.g., “metser” vs. “metselaar”) that reference the same job role.

  • Overly Generic Terms: Users sometimes enter very broad descriptions, making it hard to decide which specific reference title applies.

Thus, what was once a deliberate UX choice to offer free-text flexibility is now limiting the ability to leverage that data for various customer-focused initiatives. An up-to-date classification system is the logical next step, and modern AI can handle much of that effort automatically.

Do Many Hands Really Make Light Work?

Essentially, all free-text fields need classification into specific job title categories so that the client can provide more tailored services. While several legacy data management actions have been taken, many proved non-scalable because they were done manually—time-consuming, prone to error, repetitive, and ultimately draining resources in a way that hindered operational efficiency.

The client also needed accurate insights for benchmarking, and it suspected that current technology could offer a more advanced solution for data management. Hence, the goal was to automate classification of free-text job titles into standardized reference categories, enhancing data quality and consistency without requiring additional workforce or compromising customer satisfaction.

Faktion’s “Intelligent Data Quality Optimization” (IDQO) toolbox—an AI-driven suite of solutions for automating data tasks—provides a robust and scalable way to eliminate manual processes.

A Phased Approach

For the project’s first phase, the client prioritized employee classification in a specific high-need industry: construction.

  1. Reference List Compilation:
    We used Large Language Models (LLMs) to build an extensive reference list of job titles specifically for the construction sector.

  2. Categorizing Existing Data:
    All existing construction-related employee data was then classified under the appropriate reference job title.

  3. Improving Data at the Source:
    Rather than simply doing periodic updates of employer and employee data, a proactive UX solution was developed to address data quality at the point of entry. New employees are immediately categorized into the reference list upon registration—based on relevant context like employer activity and salary data—within the platform.

  4. Scalability:
    The solution is designed as an AutoML system so that the client’s data teams can easily adapt and repeat the process for other industries.

Three Building Blocks for a Scalable Solution

Overall, the AI-driven solution comprises three major building blocks:

  1. Methodology to Compile a Reference List
    This involves clustering job title descriptions and leveraging both an LLM and human experts to refine and label clusters accurately.

  2. Design and Training of a Classification System
    This system predicts the appropriate reference job title for any given free-text field.

  3. Search API for Integration
    The classification model is made available through an API, enabling both the automation of historical data classification and on-the-fly suggestions for new entries.

Building Block 1: Compiling the Job Title Reference List

Initially, we applied an unsupervised clustering approach by running a hierarchical clustering model on the cleaned free-text job descriptions, grouping them by similarity. A single pass of fine-tuning was performed in collaboration with the client’s domain experts.

However, relying on unsupervised clustering alone proved insufficient:

  • Lack of Meaningful Labels: Pure clustering algorithms do not generate semantic labels; manual human interpretation is still required to assign descriptive titles to each group.

  • Inaccurate Groupings: Generic terms like “Installateur,” “Plaatser,” or “Monteur” were sometimes lumped together in ways that obscured more specific titles (e.g., “Venstermonteur” or “Trappenplaatser”).

To address these shortcomings, we leveraged LLMs for semantic interpretation, aiming to separate generic words from more specific subcategories. Here is the high-level workflow:

  1. Preprocessing: Clean the raw data (including free-text job descriptions).

  2. Unsupervised Clustering: Group the processed job descriptions.

  3. LLM Labeling: Feed the clusters to a Large Language Model to suggest appropriate labels and identify potential merges of similar clusters.

  4. Expert Validation & Feedback: Domain experts validate and correct the labels proposed by the LLM.

  5. Semi-Supervised Clustering: Incorporate feedback into the clustering process, refining group labels further.

  6. Repeat & Finalize: Continue iterating until an optimal reference list is formed.

Building Block 2: A Classification Model With Zero-Shot Learning

The next step involved building a classification model capable of mapping each free-text job description to the correct label from the reference list. Often, a perfect one-to-one match may exist. If not, the classification model must infer the best possible match.

A key challenge was the absence of a labeled dataset: the client lacked preexisting links between free-text entries and standardized reference titles. To surmount this, we employed a zero-shot learning approach with SetFit.

  1. Zero-Shot Framework (SetFit):
    The SetFit framework can generate synthetic training examples based on user-defined labels. It then trains a model on these synthetic samples, effectively “bootstrapping” a classifier where no initial labeled data is available.

  2. Initial Model & Preliminary Labeled Dataset:
    Once trained, the SetFit model labels the existing free-text job descriptions, creating a preliminary labeled dataset.

  3. Human-in-the-Loop with Argilla:
    We used Argilla, an annotation platform, to collect expert feedback on the initial predictions. Experts could quickly correct the assigned labels, providing valuable guidance on classification.

  4. Refinement:
    The corrected labels are fed back into the SetFit model for retraining, incrementally improving performance. This feedback loop can be repeated until results reach an acceptable accuracy threshold.

Building Block 3: API Integration

Finally, a Search API was created to classify both historical data and new entries in real time. Upon data entry:

  1. Heuristic Matching:
    The system checks if the entry precisely matches an existing reference title or a previously labeled entry. If so, it assigns the known label directly.

  2. SetFit Classification:
    If no direct match is found, the input is sent to the SetFit model, which outputs a predicted reference job function along with a confidence score. A high-confidence prediction is stored automatically.

  3. Ensemble Method (Low Confidence Handling):
    If the SetFit model’s confidence is low, the input goes to an ensemble of different models:

    • A BERT-family classifier

    • An OpenAI model

    • A semantic search module

    These models vote on the best classification. Majority voting decides the final label, ensuring higher accuracy in uncertain cases.

Because the same API powers both historical data classification and new entry suggestions, it seamlessly keeps all data consistent and up to date.

The Benefits of Automation

In the first phase, we delivered:

  1. A Reference List of Construction Job Titles:
    Comprehensive and refined, covering the key roles in the construction industry.

  2. A SetFit Classification Model:
    Capable of automating the backlog of existing free-text data and accurately predicting future entries.

  3. Search API:
    For real-time job title suggestions and classification, integrated directly into the client’s platform.

This approach significantly improves insights for construction-sector customers, ensuring more tailored services. All steps are well documented, paving the way for future rollouts in other industries.

Moreover, the entire solution involved close collaboration among various stakeholders—domain experts, data specialists, IT/architecture specialists, and management—to guarantee a holistic perspective on requirements. This emphasis on knowledge transfer means that the client’s teams can create and train new data pipelines without extensive ML engineering skills.

Next, the approach will be further operationalized and extended to additional sectors. This will continue to elevate data quality across the board. Ultimately, the client will have a robust and well-structured reference list spanning all industries they support—fully integrated into the platform.

Key outcomes include:

  • Consistent Data Quality for more reliable analytics and decision-making.

  • Increased Operational Efficiency by automating repetitive tasks and minimizing manual overhead.

  • Enhanced Services and Customer Experience thanks to clean, well-labeled data that yields sharper insights and more personalized offerings.

By embracing a data-centric approach and leveraging AI-driven classification, the client lays a strong foundation for high-quality data and scalable AI initiatives in the future—transforming data from a liability into a driver of innovation and strategic growth.

Get in touch!

Inquiry for your POC

=
Scroll to Top