2008-01-12
Two extractors couldn't work together
关键字: ruby scrubytHi,everyone
I have enjoyed Scrubyt for days and it worked greatly in most case.However,problems came out when scraped urls from Google and Yahoo at the same time.Here is my code:
require 'rubygems'
require 'scrubyt'
Scrubyt.logger = Scrubyt::Logger.new
query = 'ruby'
google_data = Scrubyt::Extractor.define do
fetch 'http://www.google.com/ncr'
fill_textfield 'q', query
submit
#retrieve by xpath
title "/html/body/div/div/div/a" do
url "href", :type => :attribute
end
end #end of extrator
google_file = File.open("google.xml", "w")
google_data.to_xml.write(google_file, 1)
google_file.close
yahoo_data = Scrubyt::Extractor.define do
fetch 'http://search.yahoo.com'
fill_textfield 'p', query
submit
#retrieve by xpath
title "/html/body/div/div/div/div/div/div/div/ol/li/div/h3/a" do
url "href", :type => :attribute
end
end #end of extrator
yahoo_file = File.open("yahoo.xml", "w")
yahoo_data.to_xml.write(yahoo_file, 1)
yahoo_file.close
Running Environment: Ubuntu 7.04 + Netbeans 6.0 + Scrubyt
google.xml
<root>
<title>
<url>http://www.ruby-lang.org/</url>
</title>
<title>
<url>http://www.ruby-lang.org/en/20020101.html</url>
</title>
...
<root>
yahoo.xml
<root>
<title>
<url>http://rds.yahoo.com/_ylt=A0oGklhqbodHe08AchtXNyoA;_ylu=X3oDMTE5MXY5dDllBHNlYwNzcgRwb3MDMQRjb2xvA3NrMQR2dGlkA1lTMTk4XzgyBGwDV1Mx/SIG=11ff2e34s/EXP=1200144362/**http%3a//www.ruby-lang.org/en</url>
</title>
<title>
<url>http://rds.yahoo.com/_ylt=A0oGklhqbodHe08AdBtXNyoA;_ylu=X3oDMTE5cHJpN25qBHNlYwNzcgRwb3MDMgRjb2xvA3NrMQR2dGlkA1lTMTk4XzgyBGwDV1Mx/SIG=12aq03736/EXP=1200144362/**http%3a//en.wikipedia.org/wiki/Ruby_programming_language</url>
</title>
...
<root>
If switched the order of two extractors,that's define yahoo extractor fitstly,the result changed:
google.xml
<root/>
yahoo.xml
<root>
<title>
<url>http://www.ruby-lang.org/en</url>
</title>
<title>
<url>http://en.wikipedia.org/wiki/Ruby_programming_language</url>
</title>
.....
<root>
It seems the latter extractor will be influenced by the former one. Since xpath I used for Yahoo is longer than Google, the result form Google is empty when defined Yahoo extractor firstly.
Why is that and how can I overcome this problem? Thanks in advance.
发表评论
提醒: 该博客已发表在公共论坛,博客所有留言会成为论坛回贴,留言请注意遵守论坛发贴规则
- 浏览: 47627 次
- 性别:

- 来自: 广州/成都

- 详细资料
搜索本博客
最近加入圈子
最新评论
-
Two extractors couldn't ...
Update Scrubyt 0.3.4 to 0.4.01
-- by Dustin -
Bridge模式在JDBC中是如何 ...
哦。是这样的吧。JDBC有两个变化点。1.平台的变化2.数据库的变化。所以,这个 ...
-- by fireflyc -
Bridge模式在JDBC中是如何 ...
我觉得是jdbc的应用,实现DAO的时候用到了bridge模式吧,好像在哪里看到 ...
-- by crazycow -
背后的路【3】
一口气看完了...
-- by crazycow -
背后的路【2】
楼主有很多经历和我很像,希望能交个朋友,呵呵:)
-- by crazycow






评论排行榜