Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

yes goglebots can read keywords with OCR in your site logo images

Can somebody confirm that? I have never heard about this before, seems cool.



I was surprised too. I don't see why Google would spent cpu time on this, and I can't think of a way one would be able to validate that claim.


why Google would spent cpu time on this

CPU time is cheap, and it could be useful ranking information. Compared to an H1 tag, a graphical header is harder to create, so it shows a certain level of investment.

If the words in the graphic aren't repeated elsewhere in the page, then the page author may be naive in the ways of SEO. But that could be good, because the savvy SEO people are all trying to pull one over on the GoogleBot.


CPU time may be cheap, but Google has billions of page and images to index. If they were able to extract text from images, they'd probably use it for their image search. I'm also not convinced that the header image is a good indicator of relevancy.


Tens or hundreds of billions of pages -- but many, many fewer images. Images repeat across a site, and change much less often than text. Header and navigational images are easy to pick out and also relatively easy to OCR. (These images are not trying to be inscrutable, like CAPTCHAs.)

Further: Google has an intense interest in OCR, adopting the open-source 'Tesseract' project, spending millions on scanning first catalogs then books and journals, and most recently announcing they are OCRing bitmaps in PDFs:

http://googleblog.blogspot.com/2008/10/picture-of-thousand-w...

Finally: any reasoning based on Google being miserly with cycles is going to be wrong. (They invest a lot in efficiency, yes, but that's so that they can spend cycles freely to collect data.)

It's possible they ran an experiment and found text in embedded/header images was no better than inline text. (I doubt they would find such a thing, because generally indicators of sustained effort -- careful design, site longevity, good writing -- are also indicators of site quality.) But there's zero chance the CPU cost or scale deterred Google from testing the idea.


Just one additional reason, to add to your own: the original intent of YouTube, as I recall, was to OCR video for search indexing, which was a lot more complicated and processor-intensive than even OCRing pictures. Google bought YouTube; obviously this tech, laying about somewhere in their archives, came with the acquisition.


I wouldn't be surprised they tested it. I agree with you on that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: