275 ukufundwa

Yintoni Ukukhishwa kwe-Text kusuka ku-PDF Akukho Hamba Nge-Hack-And The Legacy Design That Keeps AI Stuck

nge Paolo Perrone6m2025/07/01
Read on Terminal Reader

Kude kakhulu; Uzofunda

I-PDF yenzelwe eminyakeni angama-1990 ukuze kuqinisekisa ukuthi idokhumenti bayazi efanayo emhlabeni wonke – isixazululo enhle ye-print era, kodwa isithombe ye-modern machine processing. I-print-first design yakhelwe izakhiwo ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ezingenalutho ez
featured image - Yintoni Ukukhishwa kwe-Text kusuka ku-PDF Akukho Hamba Nge-Hack-And The Legacy Design That Keeps AI Stuck
Paolo Perrone HackerNoon profile picture
0-item
1-item
2-item

I-Developers esebenzayo ne-LLM isebenza ngokushesha ku-document analysis. Futhi ngenyanga ezimbalwa, kukhona isixazululo esitsha se-hype (noma ukuphazamiseka) mayelana ne-PDF problem. Phakathi kwezi zinyathelo, akuyona okungaziwa ukubona abacwaningi be-software abacwaningisa ukuthi indlela eyodwa ye-file format uye kwenziwa ngempumelelo esikhulu. Kodwa isixazululo akuyona entsha.


Ngamunye ngaphambi kwe-LLM wahlala umfanekiso, i-SaaS jikelele yasungulwa ngokuvumelana nokulawula ukuxuba kwe-PDF. Futhi okungenani elihle, kuyinto ifomati ebonakalayo ngokuvumelana ne-access eyenziwe, eyenziwe ngekhompyutha.


Uma isofthiwe uye asebenzayo njengama-Adobe Acrobat kanye ne-PDF format, kuqala ukujabulela njengama-part of the landscape. It is easy to forget that behind that ubiquity were real design decisions, constraints, and tradeoffs made by real engineers solving real problems. I-problems that, over time, evolved and became the roots of today's pain.


Yini, PDFs zihlukile. Kodwa ziye zihlukile zihlukile. Ngokuvamile, zihlukile isixazululo esivamile esihlangene isikhathi yabo.


Ngakho-ke, thina siphindezi. Le ncwadi ithatha iminyathelo embalwa ukuhlola izidakamizwa ze-PDF format: indlela yaziwa, izimo ezivela ekuphenduleni, futhi indlela izixazululo ezisungulwe ekuphenduleni kwama-1990. Umthombo: ukholelwa ukuthi akuyona kuphela "ngaba lokhu kulula?", kodwa futhi "ngaba siphindezi lapha?"

Ukuguqulwa kwama-80s, kusukela iphepha kuya ku-pixels.

Umhlahlandlela waqala. Amakhompyutha yekhompyutha asebenza ngokushesha, futhi amakhompyutha amaphepha akuyona imizuzwana. Software ezifana VisiCalc, WordStar, WordPerfect, futhi Microsoft Word esifushane indlela entsha yokubhala, ukuguqulwa, nokudlala.

Ngaphansi kwama-80s, PC suites babe zonke kodwa abalandeli isithombe. Abasebenzi abavela ukuguqulwa izifundo amaminithi ngaphambi kokufika. Abasebenzi abalandela "isithombe-ke-if" izinhlelo ezivela embhedeni. Abasebenzi abalandela izifundo embhedeni. Abasebenzi abalandela izifundo zokufaka nge-digital blueprints.


Ngaphezu kwalokho, izidakamizwa ziye kwenziwa indawo entsha yokusebenza. Akukho kuphela imikhiqizo ekupheleni, kodwa lapho umsebenzi kwenziwa.

I-1990s kanye nokuzalwa kwe-PDF.

Kuqala kwama-1990s, ukuthuthukiswa kwe-PC-based word processing kanye ne-file sharing ye-electronic yasungula imibuzo emininzi, kanye nokuvumela ezintsha. Yonke ikhompyutha iye iye i-fonts yayo, ama-printer drivers, kanye ne-layout quirks. I-report ebonakalayo emhompyutha eyodwa angakwazi ukucindezeleka njenge-shutdown ebonakalayo eminye. Ukudlala amafayela kwangena.


Ukuze ukuguqulwa lokhu, ngo-1991, u-Adobe co-founder John Warnock noqela wakhe waqala inkqubo ebizwa ngokuthi "Camelot" yokwenza ifomati ephelele kakhulu. Umphumela we-PDF, ifayela elihlanganisa i-fonts, i-graphics, kanye ne-page layout yonke endaweni eyodwa. Lokhu "i-digital paper" ivimbele ukuthi ama-documents zibonele ngokufanayo emhlabeni wonke, noma ku-Windows, i-Mac, noma nayiphi na ifriji.


Ngokuvimbela zonke izifundo, imifanekiso kanye nezifundo ezisodwa ifayela, amafayela ze-PDF zithumela abasebenzisi ukunikezela izidakamizwa ngaphandle kokufunda, futhi into etholakalayo ku-screen lithunyelwe ngokuqondile emhlabeni wonke. I-Adobe lithunyelwe i-Acrobat Reader mahhala ngo-1994, futhi eminyakeni eminyakeni, i-PDF lithunyelwe ku-go-to-format yonke into kusuka ku-manual ye-product ne-corporate reports kuya ku-government forms ne-academic papers.


Kuqala kwe-2000s, "i-export as PDF" iyindlela ye-one-click emangalisayo eminye imishini yokubhalisa, futhi izimboni ezivela ezivela ku-distribution, i-archiving, ne-compliance.

The PDF Design Trap

Umklamo we-PDF Design Trap

I-adjustment ye-PDF efanelekayo kakhulu (umthombo wayo we-pixel-perfect fidelity) iguquguquka isivumelwano esifundeniwe: i-containers ifakwe ku-strictly, print-first structure.


Ngaphansi kwelinye iphepha okuhlobene ngempumelelo kuyinto snapshot yedijithali, eyenziwe ukuhlangabezana ne-printer. I-headings, i-tables, i-paragraphs, akukho nawo ingozi semantic. Ukuze i-computer, kwaba kuphela ama-coordinates kanye nama-text boxes ezivela embhedeni.


Okokuqala, lokhu akufanele. Kodwa njengoba idokhumenti waya ukusuka kwi-desktops ku-web browsers, i-screens ye-mobile, kanye ne-pipeline ye-automated, ama-pipelines ziye zihlanganisa. Ingabe ufuna ukuthatha idatha enhle? Reflow umbhalo ku-telephone? Ukuphumelela isakhiwo se-document? Ngizodwa, okuhlobene okuhlobene kumadoda kwangaphakathi kwangaphakathi kwangaphakathi kumatshini.
Ideal vs. canvas: why PDF feels uniquely hostile

I-PDF kanye nezinye izinzuzo ze-modernization

I-Adobe ayikwazi ukufinyelela kwebhizinisi. I-Tagged PDF (eyaziwa ngo-2001 futhi kwenziwa ngokuvamile ku-PDF/UA ngenxa ye-accessibility) inikeza isakhiwo se-HTML efana ne-HTML. Ayikwazi ukufinyelela ku-universal, kodwa iyatholakala kumadokhumenti e-government futhi isetshenziselwa kakhulu kuma-workflows yebhizinisi amakhulu. Amanye ama-milestones, njenge-PDF/A for long-term archiving, support ye-XMP metadata, kanye ne-2008 yokuthumela i-specific ku-ISO, zibonisa izinzuzo ezisebenzayo yokuqinisekisa ifomu. Nokho, ukuvumelela okuphakeme; Ukuhlobisa kunokwenzeka kubasebenzisi abaninzi, kubaluleke


I-ecosystem ephelele ye-SaaS iye ifakwe ukuze ifake le khulula. Uyakwazi ukufinyelela ku-heavyweights efana ne-DocuSign, ku-web-based PDF editors efana ne-DocHub, kanye nama-open-source libraries efana ne-Poppler, okuyinto abaphumelele kuphela ukuchitha umbhalo kusuka ku-PDFs.


Ngenxa yalokho, abacwaningi ezinkulu ze-cloud zihlanganisa wonke umzimba we-AI kulokhu: i-AWS nge-Textract, i-Google nge-Document AI, ne-Microsoft nge-Azure AI Document Intelligence. Umthengisi wahlala, imikhiqizo wahlala, futhi inani elikhulu lwezimali wahlala. I-Adobe, ngaphandle kokufanele noma akufanele, wahlala umdlalo.

Ukukhishwa kwe-AI-Native PDF Handling

Uma i-ChatGPT ithathwe, i-"i-PDF problem" ithathwe. Izinkampani zihlanganisa ukuhlinzeka idatha zabo ku-LLMs, kuphela ukuhlangabezana umugqa: iningi le ulwazi olungabizi lithathwe ngaphakathi kwe-PDFs.


Okokuqala, izinga lula: nje ukuthatha umbhalo clean for Retrieval-Augmented Generation (RAG). Kodwa lokhu ngokushesha kuboniswa kakhulu. Ngaphandle ukucindezeleka layout, umbhalo evela kumasombululo wahlukaniswa, amasethi wahlukaniswa ezingenalutho, izithombe wahlukaniswa, futhi umklamo ebalulekile wahlukaniswa.


I-Document AI modern isitimela amamodeli ukubuyekeza ukubukeka kwe-visual ne-logical layout ye-document: ukuhlola i-titles, i-paragraphs, i-tables, ne-images. Ngakho-ke, i-AI ingakwazi ukuxhumana ulwazi, ukujabulela ama-headers / ama-footers ezivamile, nokufumana isakhiwo jikelele.


I-AI stack ibonise ububanzi obuphelele lokusebenza nathi. I-data extraction ebonakalayo manje inikeza izindlu ezingu-specialized:

  • Ukubuyekezwa kwe-Layout Ukubuyekezwa kwe-Document Structure
  • I-OCR yokuhlanza umbhalo kusuka ku-images kanye nama-documents e-scanned,
  • VLM ukuhanjiswa ukuhanjiswa ngezinto ezahlukene ze-AI.

Custom AI pipelines layers required for document processing 

I-Irony iyathanda: Sitholela ezinye amamodeli amakhulu e-AI eyenziwe ngexesha elidlulile ukuhlangabezana inkinga eyenziwe ngama-30 iminyaka eduze ukwelashwa ama-documents njenge-photographs.


Nangona i-PDF yakhelwe ngokushesha, i-DNA yayo yokudluliselwa okokuqala ivumela ukuxhumana nezinsizakalo zokusebenza ngamunye. Ama-format eyenziwe, i-scanned noma i-photographed, ibonise ezinye izinzuzo ezivamile, kodwa isakhiwo se-PDF ibonise isifo.

Isinyathelo eside

Uneminyaka eminyakeni ye-PDF ayikwazi ukuchithwa ngenyanga eminyakeni, kodwa singakwazi ukuvikela ukuhlangabezana kwethulo. Ukuze izinto ezintsha, chofoza ama-born-digital formats ezihambelana ne-semantics ngokuvamile:

  • I-HTML5 ye-web,
  • I-Markdown-derived standard ye-docs zobuchwepheshe,
  • noma DOCX/OOXML lapho Office ukusebenzisana kuyimfuneko.


Uma ifayela le-layout ye-fixed-layout iyakwazi ukuvikelwa, ukulayisha nge-tags ephelele kanye ne-metadata enhle; ezinye izixhobo zokubhalisa nazo zithuthukisa okuzenzakalelayo. Izinsizakalo ze-purchase ze-government okuyinto zihlanganisa ukunemba kwe-PDF/UA zihlanganisa izinzuzo ezinhle. Ukucindezeleka okufanayo kumakhasimende nama-suppliers kanye nama-regulators kungathuthaza i-tagging kusuka ku-" nice-to-have" kuya ku-"table stakes".


I-standards eside eside, efana ne-Portable Web Publication ye-W3C noma i-EPUB 3, kanye ne-containersized JSON-based formats ezidlulayo, ibonise ukunambitheka ngaphandle kokuqinisa isakhiwo. Ukusekela lezi ku-mainstream authorship tools (ne-educating users to adopt them) ngeke ikhasimende elilandelayo ukwakha amamodeli we-visions nje ukusika umbhalo kusuka ku-contract.


Umlando we-PDF inikeza ukuthi imibuzo yayo yayo yayo yama-designing eyenziwe iminyaka eminyakeni. Umfundiso akuyona ukujabulela ama-engineers abahlukanise inkinga we-1991; kuyinto ukujabulela ukuthi izincazelo ezinhle ezidumile zibe izincazelo ezinzima ezinzima. Thina siphinde i-semantics ekhukhwini, i-back-open, i-machine-readable standards, futhi ukuqinisekisa ukuthi umkhakha elilandelayo we-document technology iyakhiwa kumadivayisi kanye namakhasimende.


Ukuze amaqela asebenzayo kuma-formats ezivamile, izixhobo ezifanaNgena ngemvumeinikeza i-API-based pipeline yokuguqulwa ama-documents amangalisayo kuma-format eyenziwe, eyenziwe ngokufanelekileyo ku-LLM kanye ne-RAG workflows, efumaneka njengoba ama-endpoints e-hosted noma isakhiwo se-self-managed.

Ngena ngemvume

Chunkr’s document section detection features in action


Ukukhangisa ukubuyekeza ubudlelwane wakho njenge-Tech Professional?

Ukusebenza kwe-Tech Audience AcceleratorI-newsletter ye-go-to ye-tech creators enhle yokukhula ubudlelwane yabo. Uyakwazi ukufumana izakhiwo, amasampula, kanye ne-tactic ezisekelwe eminyakeni yami ye-30M (ne-counting).




Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks