使用java实现数据爬取的功能(爬虫)
在日常生活中,为了实现对数据的快速获取或者生成,可能没有那么大的数据,所以今天介绍一个小小的在网站爬取数据的案例。
·
在日常生活中,为了实现对数据的快速获取或者生成,可能没有那么大的数据,所以今天介绍一个小小的在网站爬取数据的案例
爬取目标:书城
我这里爬取的是图片路径以及书名存储到数据库(mysql),并将图片存储到本地(D盘下),
主要依赖:
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.14</version>
</dependency>
代码功能块:
public class Spide {
@Autowired
private GoodsMapper goodsMapper;
public void getHTML(String url) {
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet(url);
CloseableHttpResponse response = null;
try {
response = httpClient.execute(httpGet);
String content = EntityUtils.toString(response.getEntity());
Document document = Jsoup.parse(content);
Elements elements = document.select(".tushu");
for (Element element : elements) {
Goods goods = new Goods();
String imageUrl = element.select(".cover > a > img").first().attr("src");
getImage(imageUrl);
goods.setPicture(imageUrl);
String title = element.select(".name").first().text();
// System.out.println(title);
goods.setName(title);
// goodsService.save(goods);
goodsMapper.insert(goods);
}
} catch (Exception e) {
throw new RuntimeException(e);
}
}
//爬取下来的图片存储到D盘
public void getImage(String imageUrl) {
String imageDir = "D://spideimages//";
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet(imageUrl);
CloseableHttpResponse response = null;
try {
response = httpClient.execute(httpGet);
InputStream is = response.getEntity().getContent();
String newFileName = UUID.randomUUID().toString().replaceAll("-", "");
String suffix = imageUrl.substring(imageUrl.lastIndexOf("."));
FileOutputStream fos = new FileOutputStream(imageDir + newFileName + suffix);
byte[] b = new byte[1024];
int len = 0;
while ((len = is.read(b, 0, b.length)) != -1) {
fos.write(b, 0, len);
}
fos.close();
is.close();
} catch (Exception e) {
throw new RuntimeException(e);
}
}
}
测试代码:
@Test
public void test() {
for (int i = 1; i <= 15; i++) {
spide.getHTML("https://book.dangdang.com/list/newRelease_C01.03_P" + i + ".htm");
}
}
结果:
更多推荐
所有评论(0)