Engineering a High-Performance Bid Orchestrator: Scaling Government Data Extraction

The Mission

Government portals like GeM (Government e-Marketplace) are built with legacy architectures that prioritize security over developer experience. I found the manual process of tracking “Drone” bids—which involves navigating non-persistent AJAX tables and manually digging through multi-page PDFs for hidden technical specifications—to be a significant operational bottleneck.

I engineered a Bid Orchestrator to automate this entire lifecycle. The goal was simple: move the heavy lifting to a background worker, extract every nested link within a bid document, and serve the data through a clean Infrastructure Telemetry dashboard.

Hard Problem #1: Resolving Nested PDF Dependencies

The biggest technical friction wasn’t just scraping the HTML; it was the data buried inside the documents. Most government bids contain links to secondary technical specification PDFs. Standard text extraction libraries fail here because these links are often embedded as PDF “Annotations” rather than plain text.

I discarded lightweight parsers in favor of pdfjs-dist. I had to architect a custom rendering logic that specifically targets the Link subtype in the PDF’s internal annotation map.

const extractLinksFromPdf = async (filePath) => {
    const links = new Set();
    const dataBuffer = new Uint8Array(fs.readFileSync(filePath));
    const loadingTask = pdfjs.getDocument({ data: dataBuffer, disableFontFace: true });
    
    const pdf = await loadingTask.promise;
    for (let i = 1; i <= pdf.numPages; i++) {
        const page = await pdf.getPage(i);
        
        // Extracting both visible text and hidden hyperlink objects
        const [textContent, annotations] = await Promise.all([
            page.getTextContent(),
            page.getAnnotations()
        ]);

        const text = textContent.items.map(item => item.str).join(' ');
        const annotationLinks = annotations
            .filter(ann => ann.subtype === 'Link' && ann.url)
            .map(ann => ann.url);

        // State Synchronization: Consolidating text-based URLs and Object-based links
        const urlRegex = /((?:https?:\/\/|www\.)[^\s<"']+(?:\.pdf))/gi;
        const textLinks = text.match(urlRegex) || [];

        [...textLinks, ...annotationLinks].forEach(url => {
            const cleanUrl = url.trim().replace(/[.,;)$]+$/, '');
            if (cleanUrl.toLowerCase().endsWith('.pdf')) links.add(cleanUrl);
        });
    }
    return [...links];
};

Hard Problem #2: The State Synchronization Engine

Scaling Playwright to handle hundreds of pages is a recipe for memory leaks and IP bans if done synchronously in the web process. I moved the scraping logic into a Scoped Resource Lifecycle Management system using BullMQ and Redis.

The friction here was ensuring the frontend knew the status of a job without constant page reloads. I implemented a State Synchronization Engine using a search_jobs table in SQLite. When a user requests a search term that doesn’t exist, the system creates a “Pending” state, hands the task to a BullMQ worker, and the Admin Infrastructure Telemetry tracks the worker’s progress in real-time.

const worker = new Worker('scrapeQueue', async job => {
    const { searchTerm } = job.data;
    
    // Atomically move state to Active
    db.prepare(`UPDATE search_jobs SET status = 'active' WHERE term = ?`).run(searchTerm);

    try {
        await scrapeAndStoreBids(searchTerm);
        db.prepare(`UPDATE search_jobs SET status = 'completed' WHERE term = ?`).run(searchTerm);
    } catch (err) {
        db.prepare(`UPDATE search_jobs SET status = 'failed' WHERE term = ?`).run(searchTerm);
        throw err; // Trigger BullMQ's exponential backoff
    }
}, { connection, concurrency: 1 });

The Safety Section: Secure Resource Access

Government servers often utilize self-signed certificates or misconfigured SSL chains. Hardcoding a bypass is dangerous, so I scoped the insecure HTTPS agent only to the PDF downloader middleware.

For data egress, I implemented a temporary download link system. Instead of exposing the file system structure, the app generates a UUID token mapped to a specific user and bid number. This token expires in 30 minutes, ensuring that bid documents aren’t indexed by crawlers or accessed by unauthorized users.

app.get('/download/:token', (req, res) => {
    const { token } = req.params;
    const linkData = db.prepare(`
        SELECT * FROM download_links 
        WHERE token = ? AND expires_at > CURRENT_TIMESTAMP
    `).get(token);
    
    if (!linkData) return res.status(403).send('Unauthorized or Expired');
    
    const filePath = path.join(process.cwd(), 'storage', 'bids', linkData.bid_number, linkData.filename);
    res.download(filePath); // Node.js stream-based file delivery
});

The Strategic Conclusion

By offloading the scraping to a background worker and using better-sqlite3 for local caching, I reduced the bid retrieval time from 120 seconds (manual) to sub-500 milliseconds for cached data. I sacrificed the simplicity of a single-file script to gain a resilient, multi-process architecture that can recover from network failures automatically.

Next on my board is implementing a Tesseract OCR layer into the PDF parser. Many government bids are still uploaded as flat images, making text-based extraction impossible. Adding an OCR middleware will complete the data extraction pipeline for even the most legacy bid formats.