Reference-Collector: From Google Scholar List to Downloaded Papers

Notes on a small toolbox to collect paper metadata and available PDFs from Google Scholar.


If you have ever tried to collect dozens (or hundreds) of papers from Google Scholar manually, you know the process is repetitive: copy title, search DOI, open publisher link, check whether PDF is available, and then repeat.

I recently published a toolbox to reduce this manual work:

Reference-Collector on GitHub

Reference-Collector icon


I. What is Reference-Collector?

Reference-Collector is a pipeline that:

In short: it helps convert an online Scholar list into a structured local reference workspace.

Reference-Collector pipeline


II. Why I built it

The main motivation is efficiency and traceability.

The tool is especially useful when starting a new literature review and you want a fast first pass of what can be collected automatically.


III. Getting started

1. Install

python -m pip install -r requirements.txt

2. Run from a Scholar profile

python main.py --profile-url "google scholar profile page url" --workdir _results/demo

3. Run from a cited-by page

python main.py --cited-url "google scholar paper 'cited by' page url" --workdir _results/cited_demo

4. Metadata only (skip PDF download)

python main.py --profile-url "..." --no-download

5. Start from an existing sheet

python main.py --from-xlsx /path/to/metadata.xlsx --workdir _results/from_sheet

If Scholar asks for captcha, export cookies.txt from your browser session and pass it with --cookies-file.

Reference-Collector CLI screenshot

Reference-Collector UI screenshot


IV. What the outputs look like

The default outputs include:

This means you can finish automated collection first, then spend manual effort only on unresolved items.

Reference-Collector report example


V. Notes

If you are interested, check the project details and updates here:

https://github.com/Zhang-Xuewen/Reference-Collector

License

The project is released under the APACHE license. See LICENSE for details.

Copyright 2026 Xuewen Zhang

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Recent Posts

Timeline of posts