使用 AI 撰寫爬蟲產生報告的感想 Thoughts on Using AI to Write Scrapers for Report Generation

實際動手寫爬蟲 + AI 來處理報告,才發現有超多執行面的細節。全自動的理想很豐滿,現實是很麻煩的。第一個麻煩就在於爬蟲技術實踐,要嘛花錢買服務 (Proxy 或爬蟲 API) 走捷徑,要嘛就得自己花大把時間跟「反爬蟲機制」奮戰。時間就是金錢。

When actually building web scrapers with AI to process reports, I discovered countless practical details. The ideal of full automation is beautiful, but reality is messy. The first challenge is scraper implementation - either pay for services (Proxy or scraper APIs) as a shortcut, or spend lots of time fighting anti-scraping mechanisms. Time is money.

AI 的侷限 | AI Limitations

再來是 AI 的侷限,AI 極度缺乏「上下文關聯」,他只是讀得懂,但是不知道要什麼。如果沒有提供精確的文件當作知識庫 (KB) 並給予明確指示,它根本產不出有意義的「分析內容」,頂多只是總結,也就是目前所有 AI 服務的標配,更不用說「總結」可能不是人想要的方向。

Next are AI’s limitations. AI severely lacks “contextual relevance” - it can read and understand, but doesn’t know what you actually want. Without providing precise documentation as a knowledge base (KB) and clear instructions, it can’t produce meaningful “analysis” - at best just summaries, which is what all AI services offer now. Not to mention the “summary” might not even be what you wanted.

模型選擇 | Model Selection

模型選擇。這反而是最簡單的。直接拿最新的模型來用,就差不多了,頂多就是多串幾家 API 備用。用不同的版本產出最後仍然由人類來評價結果。

Model selection is actually the simplest part. Just use the latest models and you’re good to go. At most, integrate a few API providers as backups. Different versions produce outputs that are ultimately evaluated by humans anyway.

真正的成本 | The Real Cost

真正的成本,還有最後稍微估算費用:各種外部服務 + API,首先就省不下來。好消息是這些費用可能不過 100 鎂上下,但真正的成本是工時;整個流程的日常維護、報告從來源到流程到結果的內容調整,每個月至少要投入 6-9 小時的人力維護,還有開發需要投入的時間也不少,不知道現在顧問公司怎麼做。

The real cost - let’s roughly estimate: various external services + APIs can’t be avoided. The good news is these might only cost around $100, but the real cost is labor hours. The entire workflow requires daily maintenance - from source to process to result adjustments - needing at least 6-9 hours of human maintenance per month, plus considerable development time. I wonder how consulting firms handle this now.